Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MOLECULE PROJECTION AND ENCODING FOR MACHINE LEARNING
Document Type and Number:
WIPO Patent Application WO/2022/254170
Kind Code:
A1
Abstract:
A computer implemented method of generating a dataset associated with a molecule is described. The method comprises generating a dataset associated with the molecule by: a) providing a representation of the molecule, the representation comprising a plurality of elements in a three-dimensional arrangement, each element comprising one or more atoms; b) surrounding the representation with a multifaceted shape, the multifaceted shape comprising a plurality of facets, the representation having a position within the multifaceted shape; c) projecting each element of the representation onto one of the facets; and d) using an encoding algorithm to generate a dataset based on the projection of step c). The method also comprising generating a further dataset associated with the molecule by at least one of: (i) performing repeats of steps a)-d), wherein different isotopes of the atoms are provided in the repeat of step a) and the repeat of step d) generates the further dataset associated with the molecule; (ii) performing repeats of steps a)- d), wherein a different three-dimensional arrangement is provided in the repeat of step a) and the repeat of step d) generates the further dataset associated with the molecule; (iii) performing repeats of steps b)-d), wherein a different position is specified in the repeat of step b) and the repeat of step d) generates the further dataset associated with the molecule; and, (iv) performing a repeat of step d), wherein a different encoding algorithm is used in the repeat of step d) and the repeat of step d) generates the further dataset associated with the molecule. The generated dataset can be used for training a ML algorithm and also for predicting a property of a molecule.

Inventors:
MATTHEWS ELLA (GB)
Application Number:
PCT/GB2021/051354
Publication Date:
December 08, 2022
Filing Date:
June 02, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV BRISTOL (GB)
International Classes:
G16C20/30; G16C20/50; G16C20/70; G16C20/80; G16C20/90
Other References:
"Handbook of Computational Chemistry", 10 March 2016, SPRINGER INTERNATIONAL PUBLISHING, Cham, ISBN: 978-3-319-27282-5, article POLANSKI JAROSLAW ET AL: "Computer Representation of Chemical Compounds", pages: 1997 - 2039, XP055889895, DOI: 10.1007/978-3-319-27282-5_50
DANIEL C ELTON ET AL: "Deep learning for molecular generation and optimization - a review of the state of the art", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 March 2019 (2019-03-11), XP081270483
MOJTABA HAGHIGHATLARI ET AL: "Learning to Make Chemical Predictions: the Interplay of Feature Representation, Data, and Machine Learning Algorithms", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 February 2020 (2020-02-29), XP081610884
XU YOUJUN ET AL: "Efficient molecular encoders for virtual screening", DRUG DISCOVERY TODAY: TECHNOLOGIES, ELSEVIER, AMSTERDAM, NL, vol. 32, 1 December 2019 (2019-12-01), pages 19 - 27, XP086429572, ISSN: 1740-6749, [retrieved on 20201004], DOI: 10.1016/J.DDTEC.2020.08.004
SHI TINGTING ET AL: "Molecular image-based convolutional neural network for the prediction of ADMET properties", CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, ELSEVIER SCIENCE PUBLISHERS B.V. AMSTERDAM, NL, vol. 194, 21 September 2019 (2019-09-21), XP085886952, ISSN: 0169-7439, [retrieved on 20190921], DOI: 10.1016/J.CHEMOLAB.2019.103853
Attorney, Agent or Firm:
WITHERS & ROGERS LLP et al. (GB)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method of generating an augmented dataset associated with a molecule, the method comprising: generating a dataset associated with the molecule by: a) providing a representation of the molecule, the representation comprising a plurality of elements in a three- dimensional arrangement, each element comprising one or more atoms; b) surrounding the representation with a multifaceted shape, the multifaceted shape comprising a plurality of facets, the representation having a position within the multifaceted shape; c) projecting each element of the representation onto one of the facets; and d) using an encoding algorithm to generate a dataset based on the projection of step c); and generating a further dataset associated with the molecule by at least one of:

(i) performing repeats of steps a)-d), wherein a different isotope of at least one of the atoms is provided in the repeat of step a) and the repeat of step d) generates the further dataset associated with the molecule;

(ii) performing repeats of steps a)-d), wherein a different three-dimensional arrangement is provided in the repeat of step a) and the repeat of step d) generates the further dataset associated with the molecule;

(iii) performing repeats of steps b)-d), wherein a different position is specified in the repeat of step b) and the repeat of step d) generates the further dataset associated with the molecule; and,

(iv) performing a repeat of step d), wherein a different encoding algorithm is used in the repeat of step d) and the repeat of step d) generates the further dataset associated with the molecule.

2. The computer-implemented method of claim 1, wherein the multifaceted shape is an icosphere.

3. The computer-implemented method of claim 1 or 2, wherein each element of the representation comprises an atomic mass value, and step c) comprising determining a total atomic mass value projected onto each of the facets.

4. The computer-implemented method of claim 3, wherein step c) further comprises determining a multi-dimensional vector comprising: a first dimension representing the atomic mass value of an inner-most element of the representation; a second dimension representing the atomic mass value of an outer-most element of the representation; and, a third dimension representing the total atomic mass value projected onto each of the facets.

5. The computer-implemented method of any preceding claim, wherein the encoding algorithm, to generate the dataset and/or the different encoding algorithm to generate the further dataset, is based on a property of each element of the representation.

6. The computer-implemented method of claim 5, wherein the property of each element of the representation is at least one of: an atomic property; a property of the element relative to the multifaceted shape; and, a quantum mechanical property.

7. A computer-implemented method of training a machine learning (ML) algorithm, the method comprising: generating an augmented dataset associated with the molecule by performing the method of any preceding claim, the augmented dataset comprising a dataset and a further dataset; and inputting the augmented dataset into an ML algorithm to train the ML algorithm.

8. A computer-implemented method of predicting a property of a molecule, the method comprising: generating an augmented dataset associated with the molecule by performing the method of any of claims 1 to 6, the augmented dataset comprising a dataset and a further dataset; inputting the augmented dataset into a machine learning (ML) algorithm; and receiving as an output from the ML algorithm a predicted property of the molecule based on the augmented dataset.

9. A computer-implemented method of predicting a property of a molecule, the method comprising: generating a dataset associated with the molecule by: a) providing a representation of the molecule, the representation comprising a plurality of elements in a three- dimensional arrangement, each element comprising one or more atoms; b) surrounding the representation with a multifaceted shape, the multifaceted shape comprising a plurality of facets, the representation having a position within the multifaceted shape; c) projecting each element of the representation onto one of the facets; and d) using an encoding algorithm to generate a dataset based on the projection of step c); inputting the dataset into a machine learning (ML) algorithm; and receiving as an output from the ML algorithm a predicted property of the molecule based on the dataset.

10. The computer-implemented method of any of claims 7 to 9, wherein the ML algorithm is a neural network.

11. The computer-implemented method of claim 10, wherein the neural network is a spherical or icosahedral neural network.

12. A computer system configured to implement the method of any preceding claim.

13. A computer program comprising instructions which, when the computer program is executed by a computer system, cause the computer system to carry out the method of any of claims 1 to 11.

Description:
Molecule projection and encoding for machine learning

FIELD OF THE INVENTION

[0001] The present invention relates to generating datasets for machine learning algorithms, and more specifically datasets associated with molecules.

BACKGROUND OF THE INVENTION

[0002] Molecules have three spatial dimensions. The 3D structure is critically important. Molecules are rotationally invariant (rotating a molecule structure describes the same object), translationally invariant (translating a molecular structure describes the same object) but not inversion or reflection invariant (the mirror image of a molecular structure can be a different object with different chiral properties).

[0003] Chemistry datasets tend to be small, however, large datasets are useful for training neural networks (NN) using machine learning algorithms (ML algorithms). In the field of image recognition, data augmentation is a common technique to increase the effectiveness of small datasets. For example, by augmenting a first image with a mirror image of the first image it is possible to double the size of the dataset, from one image to two images. However, it is not clear how to perform data augmentation on molecular data which preserves symmetry and chiral properties.

SUMMARY OF THE INVENTION

[0004] Various aspects of the invention are set out in the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] Embodiments of the invention will now be described with reference to the accompanying drawings, in which:

[0006] Figure 1 shows an example implementation of predicting a molecule property using machine learning.

[0007] Figure 2 shows an example of a projection method using an icosahedron.

[0008] Figure 3 shows projections for an icosahedron, a level 1 icosphere and a level 4 icosphere net.

[0009] Figure 4 shows a resulting level 3 icosphere net based on a protein binding pocket. [0010] Figure 5 shows example representations of an icosphere net after the projection method.

[0011] Figure 6 shows an example computer system for implementing the method. DETAILED DESCRIPTION OF EMBODIMENT(S)

[0012] Methods are described below for encoding chemical and/or molecular structure by projecting it onto a multifaceted shape, optionally a sphere/icosphere, which preserves symmetry and chiral properties. This also allows for augmentation of the dataset, a technique that improves machine learning outcome by simulating the effect of a larger dataset.

[0013] The methods allow a machine learning (ML) algorithm, for example a neural network, to learn about the relationship between molecular structure and molecular properties with fewer presentations of a dataset to the ML algorithm. For simple problems, good results can be achieved after only one or two presentations of the dataset, and the ML algorithm can also be prevented from refining after 10 to 20 presentations of the data). Therefore the methods can reduce the time taken to train the ML algorithm. This works because the method makes a small dataset more like a large dataset, by effectively increasing the size of the dataset (this is via data augmentation described below). ML algorithms learn from data, and more data usually provides better results from the ML algorithm.

[0014] Figure 1 shows an example of how to convert a three-dimensional (3D) representation 11 of a molecule into a dataset 13 which can be used to train an ML algorithm 14 (using the resulting output 15) and/or can be used by the ML algorithm 14 to generate an output 15 representative of a property which can then be used for other purposes (examples of applications are given below). The ML algorithm 14 can be used to both train and for unsupervised learning.

[0015] As shown in figure 1, a first step a) of the method comprises providing a representation 11 of a molecule. The representation 11 comprises a plurality of elements in a three-dimensional arrangement, each element comprising one or more atoms. In this example the molecule is cubane. A second step b) of the method comprises surrounding the representation with a multifaceted shape 22a. The multifaceted shape comprising a plurality of facets, and the representation 11 has a defined position within the multifaceted shape 22a. In this case the multifaceted shape 22a is an icosphere (more specifically an icosahedron because it has 20 facets). [0016] In the next step c) of the method, each element (e.g. atom) of the representation 11 is projected onto one of the facets of the icosphere 22a to generate a projection 22b.

[0017] In the next step d) of the method, an encoding algorithm is used to generate a dataset 13 based on the projection 22b of step c). The encoding algorithm unfolds the projection to generate an icosphere net 12, and then generates the dataset 13 based on the icosphere net 12. The dataset 13 is then input into an ML algorithm 14. The dataset 13 can be input to train the ML algorithm 14, and/or the ML algorithm 14 may provide an output 15. The output 15 may be a predicted property described by a number or a category, and may be used to predict something about the molecule (e.g. the number could be its solubility or 'druglikeness' and the category could be 'active' or 'inactive' (against a target)). In the case where the ML algorithm 14 has previously been trained then the output 15 can be used to screen possible candidate molecules to pick which ones should be synthesized, for example in drug discovery. In the case where the ML algorithm 14 is being trained then the output 15 can be used to improve/train the ML algorithm 14.

[0018] The representation 11 has a plurality of elements (in this case each element is one atom, although it can be a group of atoms) in a three dimensional arrangement. The representation of the molecule 11 can be mapped by projecting the elements onto the surface of the icosphere 22a (i.e. each of the facets of the icosphere) as 3D coordinates, 2D coordinates, text input (e.g. SMILES, SMIRKS, chemical formulae), or as a binary as shown in Fig 1. The 3D coordinates can be derived using standard chemical tools.

[0019] Known algorithms/methods for projecting the molecular structure onto an icosphere are ray tracing and ray casting. The projection 22b can be from a predefined point within the icosphere, which can be the centre of the icosphere 22A.

[0020] The method shown in figure 1 can be fully or partially repeated to improve the training of the ML algorithm 14, and/or the usefulness of the output 15 when used to predict a property of a molecule. The method can be fully or partially repeated without augmenting the dataset 13. The method can also be improved by augmenting the dataset 13 to create an augmented dataset. The augmented dataset can be made up from the dataset 13 and one (or more) further datasets associated with the same molecule. This process is called augmentation and is described below. The ML algorithm 14 may also receive as an input a dataset (or augmented dataset) based on a new molecule - this is useful for training ML algorithms and for predicting properties of new molecules. The ML algorithm 14 can be trained using a dataset, and then used to predict a property of a molecule using another dataset.

[0021] Figure 2 shows the representation 11 placed inside the icosphere 22a which is an icosahedron (e.g. its centre could be positioned at the centre of the icosphere 22a, or positioned anywhere within the icosphere 22a). The positions of the atoms are projected from the centre of icosphere 22a onto the facets of the icosphere 22a (defining the surface of the icosphere) using projection lines 23. The icosphere 22a surrounds the representation 11, the icosphere 22a has a plurality of facets, and the representation 11 has a defined position within the icosphere 22a. The icosphere 22b can be scaled to fit the molecule 11, to encode relative angular information. Alternatively, the icosphere 22a does not need to be scaled, which gives some information on distance but limits the volume of molecule that can be processed. Example icosphere nets 24a-24d represent alternative ways to unfold icosphere 22b. Icosphere nets 24a-24d are generated from the projection 22b where the carbon atoms of the molecule structure 11 are shown by the black triangles. Multiple nets 24a-24d can be created from a single icospherical projection 22b. For example, it is possible to unfold in any direction and there are multiple different unfoldings (e.g. only four are shown in figure 2) for the icosphere 22a.

[0022] Icospheres are useful because they are a suitable and general shape for a wide selection of 3D molecule structures. For example, it is possible to surround a representation of a long molecule in an icosphere, although this would result in a lot of information on one of the axes. In this case, it is possible to use a lozenge shape or cuboidal boxes (each with many facets) instead of an icosphere. For a surface, it may be desirable to use a cuboidal box or a flattened lozenge shapes and then only do the rotations around the 2 axes. The symmetry of the molecule(s) of interest can drive the best shape for the box. These boxes are called skyboxes in computer graphics (ray tracing/casting algorithms are commonly used in computer graphics).

[0023] Figure 3 shows an icosahedron 31, which can also be called a 'level 0 icosphere' or a 'parent icosphere'. A 'level 1 icosphere' 34 (also called a 'first descendent icosphere') can be created from an icosahedron by dividing each of the triangular faces of the icosahedron into four equilateral triangles and projecting the resulting shape outwards to make each face/facet (i.e. triangle) the same distance from the centre. This process of dividing each facet and projecting outwards can be repeated to achieve different levels of icospheres (such as, an icosphere representing 'level 4 icosphere' net 35) and gives approximations to a sphere with increasing number of facets.

[0024] Although the shown icospheres (and corresponding icosphere nets 32, 34, 35) have faces of equal size, it is possible to unevenly divide the triangular faces of an icosahedron such that there are more triangles on some parts of the sphere (for example to deal with long/thin molecules with a lot of information in certain axis). This will also result in a functional system so long as a consistent multifaceted shape is used for all augmentations and the ML algorithm is coded to account for this. This will make the surface smoother and have many more facets (closer to a sphere) on some parts of the resulting multifaceted shape.

[0025] Figure 3 shows an icosahedron 31 containing a representation of a molecule, and an icosahedron net 32 based on a projection onto the icosahedron 31. A 'level 1 icosphere' 33 is also shown containing a representation of a molecule, and an icosahedron net 32 based on a projection onto the level 1 icosphere 33. The increasing level of icosphere gives more fine-grained detail about atomic positions. Figure 3 also shows a 'level 4 icosphere' net 35. More fine-grained detail is useful if the molecular structure is big with many atoms because finer grained detail can encode the position differences of all the atoms. Some problems would need this, some don’t require this level of detail. Figure 4 shows the 'level 3 icosphere' net 42 which results from a representation of a protein binding pocket 41.

[0026] Figure 5 shows a mapping to generate a numerical dataset from the icosphere net 51 (or directly from an icosphere projection) which can be implemented with an appropriate encoding algorithm. Each face (triangle) 52 of the icosphere and therefore net 51 can have a number associated with it. These faces are input to the ML algorithm 14 in a specified order, either as a vector, matrix, or list. Each atom associated with each face can be encoded as binary strings, numbers, atomic weights, charges etc. Any numerical encodings may be normalised according to standard ML approaches.

[0027] Some example numerical encodings are shown in figure 5. Net 51 shows an example of a numbering scheme for the faces of the icosphere (there are many other numbering schemes). Figure 5 shows the net 53 which results from the projection of the representation 11 of the molecule in figures 1 and 2 onto an icosahedron. By cross referencing each triangular face in number order and writing a 1 or 0, dependent on whether the triangle is black or white, an encoded dataset 54 is generated, which is a binary encoding representing the presence/absence of a molecule/atom and an example of a symbolic encoding (binary symbols in this instance) as well as an example of a list or vector encoding. It is also possible to replace the binary encoding with a symbolic encoding based on the heaviest atom that was projected onto each face of the icosphere, where 0 stands for 'no atom' (it is 0, not O for oxygen), shown in dataset 55. Dataset 55 is also a symbolic encoding as well as an example of a vector or list encoding.

[0028] In another type of encoding, the dataset 56 is represented by a matrix of N*3 (where N is the number of facets on the icosphere). This can be constructed by taking a 3-dimensional vector for each face. For example, the face numbered Ί' of icosphere net 51 has the vector: [0.0, 0.0, 0.0] and the face numbered '2' of icosphere net 51 has the vector: [12.0107, 15.999, 28.0097], and each of these vectors has three dimensions. In this case, we can take the inner-most atom's mass (i.e. an atomic mass value which represents the atom's mass) (the one that is closest to the centre-point of the icosphere as the x-dimension), the outermost atom's mass (the one that is furthest from the centre of the icosphere) as the y- dimension, and finally the z-dimension as the sum of the masses of all atoms that are projected onto that face of the icosphere. This would give the following for a carbon bonded to hydrogen (C-H) projected onto a face of the icosphere, with C closest to the centre of the icosphere can be encoded as [12.0107, 1.0078, 13.0185]. Also carbon bonded to oxygen (C=0) would be [12.0107, 15.999, 28.0097]. Where the atomic mass of carbon is a value of 12.0107, the atomic mass of hydrogen is a value of 1.0078, and the atomic mass of oxygen is a value of 15.999.

[0029] Therefore, projecting each element (e.g. each atom) of the representation of the molecule with a ray casting technique onto one of the facets can comprise determining a multi-dimensional vector. Where a first dimension represents the atomic mass value of an inner-most element of the representation; a second dimension represents the atomic mass value of an outer-most element of the representation; and, a third dimension represents the total atomic mass value projected onto each of the facets. Each element of the representation has an atomic mass value.

[0030] An example dataset 56 is shown in Figure 5, generated from the representation 11 of the cubane molecule shown in figures 1 and 2. This dataset 56 is an example of a matrix or list of lists. It is possible to have more than three dimensional vectors for each face of the icosphere, leading to a larger input. Any of these suggested encodings can be input into a ML algorithm. An icosahedral or spherical neural network also has the information as to how the faces are related to each other in space, however these inputs can also be used in any ML algorithm and the ML algorithm should learn the relationships between the faces if given enough data. The advantage of spherical or icosahedral NNs is that they make the problem easier for the ML algorithm in a situation where a small amount of data (i.e. a small dataset) is available (although the augmentation technique described herein will help to deal with this).

[0031] The symbolic or numerical encoding described above to generate the dataset 13 can also use numerical values relating to the atom or other element of the molecule. These numerical values can be for example atomic mass, charge, etc. Distances from the element to the surface can also be encoded into the dataset. For example, the inner elements (closest to the centre of mass), outer elements (closest to the surface) and the sum of the elements can be used for a tensor rank 3 input (standard for image based ML algorithms). Using the sum of the elements as an input dimension combined with the actual atomic masses allows the ML algorithm to differentiate which elements are on a particular projection line (e.g. projection lines 23 in figure 2).

[0032] Augmentation is the process of deriving a multitude of different inputs (to generate a larger, or augmented, dataset) from a single input structure, and is a ML technique common in image processing. The dataset 13 (from above, this dataset can be either a vector, matrix or list) may be augmented by various augmentation methods including:

1. Performing a repeat of step d), wherein a different encoding algorithm is used in the repeat of step d). For instance step d) may be repeated by creating a different net from the same projection 22b, for instance by picking different points on the icosphere and directions for the unfolding into the net. In this case there is no need to repeat steps a)-c) of the method. A different encoding algorithm can comprise the same input projection with a different unfolding to generate a new dataset/net (i.e. the same algorithm steps using a different unfolding number - for example, unfolding from facet Ί' of net 51 of figure 5 to generate a first dataset, and unfolding from facet 2 of net 51 in the repeat to generate a second dataset). Alternatively, or in addition, a different encoding algorithm can comprise different algorithm steps to generate a new dataset. 2. Performing repeats of steps b)-d), wherein a different position is specified in the repeat of step b). For example in the repeat of step b) the representation of the molecule may be offset from the centre of the icosphere by a translation and/or rotated (n.b different unfoldings into nets can be equivalent to set rotations). Translation can break any rotational symmetry and alignment of the elements of the molecule. In this case there is no need to repeat step a).

3. Performing repeats of steps a)-d), wherein a different three-dimensional arrangement is provided in the repeat of step a). For example this may be achieved by providing different conformers of the same molecule in step a). Conformers are different three dimensional structures that correspond to the same chemical formula and they occur as there are different ways that the same atoms can arrange themselves in 3D space (for example, boat and chair conformers of cyclohexane).

4. Performing repeats of steps a)-d), wherein a different isotope of at least one of the atoms is provided in the repeat of step a). For example, this can be achieved by switching out atoms for alternative isotopes e.g. C- 13 for C-12 (useful for spectroscopy problems).

[0033] In all of the augmentation methods 1-4 above, the repeat of step d) generates a further dataset associated with the molecule, which augments the dataset 13.

[0034] Depending on which encoding algorithm is used to generate the dataset 13 (or the augmented dataset), it is then input into an appropriate ML algorithm 14. This may be a spherical neural network for spherical inputs, or an icosahedral neural network for the nets. This will work for all augmentation methods 1-4 above. The input into a spherical neural network is the original icospherical net, and the neural network knows which facets are connected to which other (including the edges of the net) and this allows it to analyse the entire icosphere. Icosahedral neural networks convert the icosphere net triangular faces into square pixels which also have the knowledge of which faces are connected to each other - this offers a speed up on the algorithm whilst also retaining the gauge equivariance (symmetry properties).

[0035] As described above, an augmented dataset can be generated by repeating certain steps of the method to generate an augmented dataset with one of the augmentation methods 1-4 described above. It will be understood that not all of the steps a)-d) need to be repeated, this is particularly the case when a different encoding algorithm (e.g. augmentation 1: an alternative 'unfolding' of a net) is used. For example, in this case there is no need to repeat the projection step c). All of the datasets generated by the different augmentation methods can be combined to form the augmented dataset.

[0036] All of these augmentation methods can be combined to create more variation or to augment the dataset further. A larger dataset is more useful for ML algorithms. Each augmentation method can provide more data to the ML algorithm which helps the ML algorithm output more useful results. A combination of augmentation methods can be selected based on the desired output and purpose, for example, conformers can provide extra information about the flexibility of a molecule which may be of more or less benefit depending on the purpose of the ML algorithm (i.e. a molecule property of interest).

[0037] The dataset 13 or the augmented dataset can be input into the ML algorithm 14 for processing. Examples of suitable ML algorithms include, but are not limited to: supervised learning algorithms, such as, artificial neural networks, random forest, Gaussian process regression, genetic programming, naive Bayes classifier, support vector machines and others (Regression gives quantitative answers, classification gives qualitative answers); unsupervised learning algorithms, such as, PCA, clustering algorithms (like k-means, mixture modes), latent variable models like principal component analysis, and expectation-maximisation algorithm; and, reinforcement learning algorithms, such as, Monte Carlo, Q-learning, deep-Q Networks and others.

[0038] Supervised learning can be done by training ML algorithms when you have an answer to train against. Unsupervised learning can be done by training ML algorithms when you do not have an answer. Unsupervised learning finds patterns in the data, which can then be used to predict properties (qualitative) or membership of an algorithm-defined category (classification).

[0039] The datasets fed into the ML algorithm are matched to the input type. For example: spherical neural networks can be matched to icosphere input type, and icosahedral neural networks can be matched to icospherical net (unfolded icosphere) input type. ML algorithm and techniques for dealing with data which requires global features (i.e. human face recognition) could alternatively be used.

[0040] Datasets for two or more molecules may be input into the ML algorithm 14 at the same time (e.g. one dataset for a representation of a ligand molecule and one dataset for a representation of a protein molecule). This type of solution should be applied to tasks which have two inputs that interact in some way. In the protein and ligand example, it is desirable to know if the ligand molecule would fit well in the protein, and this requires knowledge of both: (i) the shape of both the protein binding pocket and the ligand which will sit in it, and (ii) the chemistry (which atoms are present in which functional groups) of both the protein binding pocket and the ligand.

[0041] In the examples above, each individual atom of the representation of the molecule is projected onto a respective facet in the projection step c). In an alternative example, instead of each atom of the representation being projected onto one of the facets, a group of atoms of the representation (which form a definable sub-structure e.g. individual amino acid residues [as used to describe proteins]) can be projected onto one of the facets. The benefit of this technique is to simplify the problem (by reducing the number of objects to consider) whilst keeping the rough shape needed for the task. This may also be called coarse- graining. In both cases, each element of a molecular representation is projected onto a facet. In the first example, the projected elements are individual atoms, and in the alterative example the projected elements are groups of atoms.

[0042] ML algorithms are typically trained by presentation of a large subset of the data (the remaining small subset can be used to test and validate the training). This can involve many presentations of the data (i.e. the data is presented to the ML algorithm many times so that the ML algorithm can extract as many useful learnings as possible), which the ML algorithm uses to refine its solution to the problem. This is done by calculating the value of some learning function (for example error from the known solution) and using that to update the values in the ML algorithm to get closer to the hoped-for learning function value. It has been suggested that the ideal is 'one-shot learning' where the ML algorithm requires only one presentation of the data (although a few presentations is fine). However, if the data is presented many times it raises the risk of overtraining. Overtraining is where the ML algorithm learns to memorize the input data instead of learning the general rules. This is not ideal, as the goal is to use the general rules so the ML algorithm can be successfully applied to new data. New data could be a representation of a new molecule where a particular property is of interest.

[0043] The methods presented herein allow the ML algorithm to learn with fewer presentations (for simple problems, decent results can be achieved after only one or two presentations, and the ML algorithm is prevented from refining after 10 to 20 presentations of the data). Therefore the method presented herein may result in a reduction in the time taken to train the ML algorithm. This is achieved because the method makes a small dataset more like a large dataset, but effectively increasing the size of the dataset (this is via the data augmentation methods described previously). ML algorithms learn from data and more data is usually better. Also, the selected data augmentation can provide the ML algorithm with information about: what is the signal in the dataset (the general rules we want it to learn) and what is the noise in the dataset (the unimportant information). By using the augmentation methods described above we provide the information that the relative arrangement of atoms (the shape) is important but the actual coordinates of those atoms in space is not.

[0044] If the ML algorithm is overtrained it memorises the coordinates of the atoms and relates that to the molecule, if it is well trained it learns the relative positions allowing it to relate the same molecule in a different position in space to the original molecule.

[0045] The output 15 of the ML algorithm depends on the problem being solved. For example it could be a number relating to some predicted chemical property of the molecule (e.g. the solubility, the energy, etc.), or it could be a predicted property of the molecule defined as a category (e.g. good drug candidate or bad drug candidate). Depending on the output required, alternative ML algorithms may be selected, for example, for outputs that are represented by a number relating to some predicted property we must use a regression form of the ML algorithm, however, for category outputs we use the classification form (the same ML algorithm can be written as either a regression or classification output form).

Alternatives

[0046] Various alternatives will now be described.

[0047] Elements (e.g. a single atom, or sub-structures of atoms) of the representation of the molecule can be encoded as:

1. symbols (see figure 5, reference numerals 54, 55)

2. colour values (which are converted to 3D numbers)

3. atomic properties e.g. atomic weight, atomic charge, formal charge (see figure 6, reference numeral 56)

4. external to the molecule properties, e.g. the distance from the atom to the surface of the icosphere

5. quantum mechanical properties, e.g. the density of a molecular orbital on that atom 6. any encoding a computer or hardware chip can use.

[0048] Alternatively, atoms can be encoded as several of these properties (1 to 6) or a combination of these properties (1 to 6).

[0049] Alternatively, the method can be extended from atoms to groups of atoms, e.g. individual amino acid residues (as used to describe proteins). The benefit of this is to simplify the problem (because there are fewer elements to consider) whilst keeping the rough shape needed for the task (This process can be called coarse- graining).

[0050] Any ML algorithm can be used and the choice depends on the encoding chosen, the properties of the ML algorithm required and the task (i.e. the desired output of the ML algorithm).

Example

[0051] An example will now be described, which explains how the method of the present invention could be used to calculate solubility which is an important property to consider when developing pharmacologically active molecules (medicines).

[0052] For example: if the task was to calculate the solubility of a molecule from its chemical structure, an option is to create the icospherical nets as a list of lists (similar to figure 5, dataset 56) where each triangle (face) is associated with a list of three numbers. The first number refers to the atomic mass of the inner most atom, the second number refers to the atomic mass of the outer most atom and the third number refers to the sum of the atomic masses of all atoms that project onto that face. This can be done if the inner-most and outer-most atoms are thought to be the most important to the problem. Note that the number of atoms that project onto a face can be more than two.

[0053] A list of lists can be converted into a tensor. The facets have a position in space relative to each other (see figure 5, net 51, 53). For example in figure 5 we see that face number Ί2' of net 51 is directly above face number Ί7'. So, by incorporating the relationship between the faces, we can convert a list or list of lists into a 2-D representation. For example, using the relation between the faces in figure 5, and assuming that the list 54 of binary values in figure 5 are given in ascending numerical order (i.e. 1, 2, 3,...20), the binary list 54 or vector 56 can be convert into the 2D representation of net 53 by colouring the faces with a Ί' value black and colouring the faces with a Ό' value white. The actual input to the ML algorithm 14 can be dataset 54 and the relationship between the faces as shown in net 51. The ML algorithm can then learn from that input.

[0054] In this example using the encoded dataset 56 in figure 5, the input is converted to three 2D representations (a 4D tensor). This is a common technique in ML algorithm known as convolutional neural networks (CNNs) (the spherical and icospherical neural networks are a type of CNN).

[0055] In addition, other information can be optionally added into the ML algorithm 14 (e.g. neural network) following standard techniques, for example, single ID vector inputs, such as some simple properties of the molecule like atomic weight or charge.

[0056] The ML algorithm 14 then computes on these inputs according to how it works (there are many ML algorithms which all work a different way) and provides an output 15.

[0057] In this example, the icosahedral net of one unfolding of a projected molecular structure is input as described with the expected solubility value (a number like 1.234), as a single ID vector input. The ML algorithm 14 predicts the solubility value based on the values within it (weights and biases in NNs). The error is calculated to see the output error of the ML algorithm, for example if the correct answer is known and is 1.234 and the NN gets the answer 1, then the output error is -0.234. The ML algorithm will then update its weights and biases (using an update rule as to how to do this) to get closer to the correct output, using known techniques.

[0058] One of the augmentation methods 1-4 above is applied. So instead of inputting one dataset, a multitude of different datasets are input which result from an augmentation such as, 'unfolding' the icosphere a different way. This is equivalent to rotating the molecule a little bit and inputting. So, a different icosahedral input net is trained against the expected solubility value, and the ML algorithm calculates the answer. Imagine in this case the ML algorithm outputs the answer 2.0, the output error is now +0.876, and the ML algorithm will again update its weights and biases accordingly. With more examples or more repetitions of the known examples, the ML algorithm learns to relate the set of icosahedral nets to the solubility 1.234. This process is the process of training the ML algorithm.

[0059] In one use case, the objective is to create a new molecule, for example a new candidate drug. Solubility would be a property of a molecule of particular interest because drugs are not useful within a certain range (as the drugs cannot get to where they need to be). The solubility value is therefore useful when you are suggesting a particular molecule, this can avoid the need for spending months making the particular molecule (at which point you can measure it). Therefore, a ML algorithm trained on icospherical nets of known (made) molecules (optionally using augmentation as described above) and their property of solubility can be used to suggest if the solubility property of a new molecule is potentially suitable.

[0060] It is possible to predict a property (e.g. solubility) of a new molecule using a ML algorithm 14 trained using the above process. The new molecule's molecular structure can be used to generate an icospherical net as described. It is possible to use one net or a plurality - using multiple nets of a single molecule (or another method of augmenting the dataset) would give more robust answers. The dataset or datasets are input into the ML algorithm and an estimate or estimates (i.e. a prediction) of solubilities is given. Based on this predicted number/value, a scientist can decide whether to synthesise the molecule in the laboratory or not.

[0061] If multiple augmentations were used, then a good prediction can be made by accounting for all of the output estimates of solubilities (for example statistical analysis could be undertaken). Using multiple nets it is possible to apply statistics to make the answer more robust, for example, by calculating the mean or median value and confidence intervals from the different ML algorithms, averaging out the error in the ML algorithm itself.

Alternative data inputs.

[0062] Additional input features can be calculated from the structure of the representation of the molecule and input to improve training. Example of input features are: physicochemical properties, topological features, connection matrices, other shape-based inputs and so on.

Apparatus

[0063] Example apparatus using the disclosed techniques for ML are mentioned below.

[0064] Example 1. Inputting the nets into a spherical or icosahedral neural network and training against the output. These neural networks have convolutions which can pick up features (for example arrangements of atoms) and combine them at the higher layers to calculate useful properties. This type of apparatus is sensitive to local structure and relative arrangements. [0065] Example 2. Inputting the nets into locally connected or fully connected based neural networks (NN). These type of NNs are used for facial recognition where the global arrangement of features is important. Using this type of NN works on the global structure of the molecule.

[0066] Example 3. Examples 1 and 2 can be combined to create a NN that works with both global and local data.

[0067] Example 4. The net inputs can be made from scaled icospheres or unsealed (and fixed size) icospheres. These can both be put into a NN (for example the ones discussed above) to create a NN that is sensitive to both relative angle and shape and specific distance/size.

[0068] Example 5. Instead of using the projections of a rotating molecule onto the surface of the icosphere as a series of still pictures, the projections can be rendered as a movie and input into ML algorithms that deal with video data. As the molecule is rotated within the icosphere the atomic projections for an atom will draw out a pattern, and this can be encoded as a list of triangular faces. These patterns allow all atoms to be input. This type of input necessitates a ML algorithm that can handle sequential data (e.g. recurrent neural networks (RNNs)).

[0069] Example 6. Alternatively the movement of individual atoms over the projections during the rotation can be simply coded as a list of numbers of which face of the icosahedron it is projected on to for which frame of the rotation. This type of data can be fed into ML algorithms which deal with sequential or time-series data (for example RNNs, but there are others). These can be trained in a way known in the art of ML algorithms. This input method offers the advantages of allowing you to encode the position of every atom in an efficient manner. Including the distance from atom to face as an input makes the inverse problem of going from ML encoded data to a structure trivial. This problem must be solved to do de novo molecule suggestion (see example application 3 below).

Example applications

[0070] Example applications using the disclosed techniques for ML are mentioned below.

[0071] Example 1: Predicting molecular properties. The 3D structure of a molecule is encoded as described above and fed into a ML algorithm, for example a neural network. The output would be a quantitative number such as the solubility or partition coefficient (both used as a measure of 'druglikeness' for suggested pharamacological molecules) or a qualitative measure such as 'active', 'inactive' for a treating a specific ailment (e.g. 'active' or 'inactive' at preventing HIV viruses from infecting cells). These types of applications are useful for picking which suggested molecules should be synthesized for further experimental testing in drug development. Similarly, the exact same approach can be taken to screen suggested molecules for other areas for example materials development, dyes, etc.

[0072] Example 2. Protein binding involves taking the structure of an enzyme (a type of protein) binding pocket and a ligand (a molecule which binds in the binding pocket, usually a pharmacological molecule which can then modify the protein's action to treat disease etc). In this problem the 3D structure of both binding pocket and ligand is critical (see lock and key enzyme model). The 3D structure of both protein binding pocket and ligand is encoded into icospherical nets and run through a dual-input (deep) neural network (one for ligand one for protein), perhaps alongside further input features (see above). For supervised ML the output can be a quantitative number such as the binding coefficient of ligand and protein (a measure of how well the ligand binds to the protein which is a measure of drug efficacy) or a qualitative measure such as 'active', 'inactive' for binding to a specific protein. Both these measures can be used for screening suggested ligand compounds for efficacy in target identification. A related problem of host-guest complexes (host = man-made non-protein large molecules that also possess a binding pocket, guest = small ligand/molecule) can be solved in exactly the same way, with similar quantitative or qualitative outputs. These systems are currently used for environmental sensors (e.g. detecting pollutants) or chemical sensors (for example, detecting the presence of certain molecules in spit/blood etc for medical diagnosis, e.g. blood sugar sensors for diabetics).

[0073] Example 3. De novo drug design is the method of inventing new molecular structures that could be synthesised. There are ML algorithms such as autoencoders and others that can convert input into a new space where the inputs are transformed and organised in a 'sensible' way. This new space can then be explored to generate new versions of the input and then read out. After encoding many molecular structures using the method described herein and inputting into an ML algorithm (e.g. autoencoder), novel molecules can be generated. The method can be used to find molecule similar to a set of input molecules or wildly different from anything seen before. This can be used for inventing new candidate drug structures, new industrial chemicals and so on. [0074] Example 4. Methods of unsupervised ML can be used with the input nets to map 'chemical space'. These maps can then be used to increase chemical understanding, find new relationships between molecules, which can be used to aid designing molecules for specific uses (e.g. drug design, materials design, finding new 'greener' solvents etc).

[0075] Example 5. The size and shape of the skybox (the icosphere in the examples above) can be expanded, a lozenge shape or cuboidal shape could be created as an alternative to icosphere. These could be used with larger molecules or surfaces. Inputting these molecules into an ML algorithm can be used to design new materials and surfaces (e.g. battery and solar cell materials).

[0076] Example 6. Retrosynthesis is the process of deciding how to make (synthesize) a given molecule (i.e. which reactions to perform in the laboratory). Retrosynthesis algorithms can have difficulty keeping track of chirality (which is critically important). As research in molecules (especially drugs) are moving from simple molecules with one or two chiral centres to ones with many it is important to keep track of both chirality and the resulting 3D structure. Using icospherical nets as the input to retrosynthesis algorithms chirality can be conserved, offering synthetic routes that produce the correct chirality and shaped molecules.

[0077] Example 7. Reaction prediction is the inverse problem to retrosynthesis, here it is desirable to predict what the output of a reaction will be. Again, chirality and 3D shape are important and again using the nets as input to ML algorithms will preserve this. Inputting the rotating series data (see apparatus 5) for everything in the flask (reactants, products, solvent catalyst etc) may be the best way to go about this.

[0078] The examples described herein are shown with respect to an icosahedron/icosphere although the techniques are also suitable for any multifaceted shape surrounding the representation of the molecule.

[0079] Figure 6 schematically illustrates a computer system configured to implement the methods described above. The computer system comprises a data processor 61, an input device 62, an output device 63, and a machine readable storage medium 64 such as a hard drive. The storage medium 64 is a non- transitory computer-readable storage medium which stores a computer program 65, comprising instructions which, when the computer program is executed by the computer system, cause the computer system to carry out any of the methods described above. The storage medium 64 also stores a database 66 of augmented datasets generated by any of the methods described above.

[0080] References to the multifaceted shaped (such as an icosphere) surrounding the representation of the molecule, can be understood to either fully enclose the molecule, or alternatively to substantially, or partially enclose the molecule.

[0081] Although the invention has been described above with reference to one or more preferred embodiments, it will be appreciated that various changes or modifications may be made without departing from the scope of the invention as defined in the appended claims.