BEHSAZ BAHAR (US)
GULER MUSTAFA (US)
LEE YI-YUAN (US)
MONGIA MIHIR (US)
YAN DONGHUI (US)
CAO LIU (US)
WHAT IS CLAIMED IS: 1. A method comprising: receiving data representing gene clusters, the gene clusters including one or more genes configured to encode one or more polypeptides or other small molecules; accessing a machine learning model, the machine learning model being trained with a training dataset that associates the gene clusters to structures of one or more small molecules represented in the data; applying the machine learning model to the data representing the gene clusters; identifying, based on applying the machine learning model, one or more monomers associated with at least one gene cluster represented in the data; and determining a structure for a natural product including the one or more monomers. 2. The method of claim 1, wherein the machine learning model is trained by performing operations comprising: accessing a set of hypothetical structures for natural products including the structure for the natural product; generating a set of random structures of molecules, the random structures including small molecules; testing, using mass spectrometry data representing known structures and the set of random structures, the set of hypothetical structures for the natural products including the structure for the natural product; generating a score for the structure, the score indicating a match between the structure and a known structure represented in the mass spectrometry data; filtering, based on the score, one or more hypothetical structures from the set of hypothetical structures to generate a filtered set of hypothetical structures that includes the structure for the natural product; and generating the training dataset for training the machine learning model, the training dataset including the filtered set of hypothetical structures. 3. The method of claim 1, further comprising: determining a class associated with the gene clusters; and accessing, based on the class, a set of training data that is specific to the class associated with the gene clusters. 4. The method of claim 1, further comprising predicting a biological activity of an identified natural product based on the machine learning model that is trained with the training dataset. 5. The method of claim 4, further comprising generating, based on predicting the activity, a data library comprising data that associates a gene cluster with a respective biological activity. 6. The method of claim 1, further comprising purifying the natural product based on the determined structure. 7. The method of claim 1, wherein determining the structure for the natural product including the one or more monomers comprises: predicting, based on the one or more monomers that are identified, a core molecule that is assembled by combining a group of monomers; determining one or more particular gene clusters represented in the data that cause a change of a structure of the core molecule; and identifying an enzyme associated with one or more particular gene clusters that cause the change to the structure of the core molecule. 8. The method of claim 7, wherein the core molecule includes a peptide, and wherein the change comprises an addition of an amino acid. 9. The method of claim 7, wherein the change comprises an addition of a lipid tail to the core molecule. 10. The method of claim 7, wherein the change comprises an addition of a monomer to the core molecule. 11. The method of claim 1, wherein the data representing the gene clusters comprises one or more data signatures, wherein data signatures comprise a location of a gene cluster with respect to one or more other gene clusters; and wherein determining the structure for a natural product including the one or more monomers is based on the data signatures. 12. A system for searching a database to identify structures of molecular compounds from mass spectrometry data, the system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising: receiving data representing gene clusters, the gene clusters including one or more genes configured to encode one or more polypeptides or proteins; accessing a machine learning model, the machine learning model being trained with a training dataset that associates structures of small molecules to one or more of the gene clusters represented in the data; applying the machine learning model to the data representing the gene clusters; identifying, based on applying the machine learning model, one or more monomers associated with at least one gene cluster represented in the data; and determining a structure for a natural product including the one or more monomers. 13. The system of claim 12, wherein the machine learning model is trained by performing operations comprising: accessing a set of hypothetical structures for natural products including the structure for the natural product; generating a set of random structures of molecules, the random structures including small molecules; testing, using mass spectrometry data representing known structures and the set of random structures, the set of hypothetical structures for the natural products including the structure for the natural product; generating a score for the structure, the score indicating a match between the structure and a known structure represented in the mass spectrometry data; filtering, based on the score, one or more hypothetical structures from the set of hypothetical structures to generate a filtered set of hypothetical structures that includes the structure for the natural product; and generating the training dataset for training the machine learning model, the training dataset including the filtered set of hypothetical structures. 14. The system of claim 12, the operations further comprising: determining a class associated with the gene clusters; and accessing, based on the class, a set of training data that is specific to the class associated with the gene clusters. 15. The system of claim 12, the operations further comprising predicting a biological activity of an identified natural product based on the machine learning model that is trained with the training dataset. 16. The system of claim 15, the operations further comprising generating, based on predicting the activity, a data library comprising data that associates a gene cluster with a respective biological activity. 17. The system of claim 12, the operations further comprising purifying the natural product based on the determined structure. 18. The system of claim 12, wherein determining the structure for the natural product including the one or more monomers comprises: predicting, based on the one or more monomers that are identified, a core molecule that is assembled by combining a group of monomers; determining one or more particular gene clusters represented in the data that cause a change of a structure of the core molecule; and identifying an enzyme associated with one or more particular gene clusters that cause the change to the structure of the core molecule. 19. The system of claim 18, wherein the core molecule includes a peptide, and wherein the change comprises an addition of an amino acid. 20. The system of claim 18, wherein the core molecule includes a non- ribosomal peptide. 21. The system of claim 18, wherein the core molecule includes a ribosomally synthesized and post-translationally modified peptide. 22. The system of claim 18, wherein the core molecule includes a polyketide. 23. The system of claim 18, wherein the core molecule includes a saccharide or aminoglycoside. 24. The system of claim 18, wherein the change comprises an addition of a lipid tail to the core molecule. 25. The system of claim 18, wherein the core molecule includes a hybrid of non-ribosomal peptide and/or ribosmally synthesized and post-translationally modified peptide and/or a polyketide and/or a saccharid or aminoglycoside, and wherein the change comprises an addition of a monomer. 26. The system of claim 18, wherein the data representing the gene clusters comprises one or more data signatures, wherein data signatures comprise a location of a gene cluster with respect to one or more other gene clusters; and wherein determining the structure for a natural product including the one or more monomers is based on the data signatures. 27. The system of claim 18, wherein structures of predicted molecules are stored in a computer format that allows for accelerated search against mass spectra. |
y , Streptomyces genome sequences # # [00157] DeepRiPP extracts more ORFs than bgc2orf, while NeuRiPP extracts fewer ORFs. DeepRiPP extracts fewer cores than orf2core leading to a larger number of predicted core peptides by seq2ripp modules compared to DeepRiPP across the sampled genomes. The BLAST method, which finds regions of local similarity between potential and known ORFs, extracts fewest number of ORFs and cores, while the exhaustive method, which extracts all the substrings of the ORFs with length ranging from 3 to 30 amino acids, results in the largest amount of ORFs and cores. [00158] Analyzing 46 PoDP datasets with 1,036 genomes, seq2ripp predicts 17,505 BGCs, 54,605 ORFs, 118,052 cores and 30,687,610 unique RiPPs (Table 4). After searching these RiPPs against corresponding spectra with Dereplicator+, three novel RiPPs are identified, as shown in FIG. 6F. [00159] Lasso-1648 is identified from Streptomyces NRRL B-2660, containing an N-terminal macro- lactam ring between N1 and D6. Based on Seq2ripp predictions, the PTM is applied by Asn-synthase (PF00733.18). Lanthi-1794 is identified from Streptomyces WC-3904. A dehydroalanine (Dha) located at S6, and three dehydrobutyrines (Dhb) located in T2, T10, T15 are potentially connected to one of the cysteines in the core peptide, froming lanthionine (Lan) or methyl-lanthionine (MeLan) rings. The PTM is applied by Lantibiotic dehydratase (PF05147.6 and PF04738.6). Lasso- 1795 is identified from Streptomyces NRRL B-2660 and WC-3560, containing an N- terminal macrolactam ring between Q1 and D8. The PTM is applied by Asn-synthase (PF00733.18). [00160] Mass spectral datasets from the PoDP database were searched against the corresponding RiPP molecules from hypoNPAtlas using Dereplicator+. At a false discovery rate (FDR) of 1% (score threshold of 15), 64 unique RiPPs (131 molecule- spectrum matches) were discovered. [00161] Seq2ripp found 48,542 RiPP BGCs, 86,562 ORFs, and 2,159,946 cores in 2,002 Streptomyces draft microbial genomes (Table 4). [00162] Benchmarking bgc2orf. Bgc2orf is compared with DeepRiPP and NeuRiPP, state-of- the-art tools for RiPP ORF identification. Since DeepRiPP and NeuRiPP are pre- trained with different datasets, two separate experiments were conducted to compare the prediction accuracy between bgc2orf and DeepRiPP/NeuRiPP on the MIBiG dataset. In the first experiment MIBiG RiPPs were discarded that were used in the DeepRiPP training data, and in the second experiment MIBiG RiPPs that were present in the NeuRiPP training data were discarded. In the first experiment bgc2orf achieved 75.0% accuracy in comparison to 70.0% for DeepRiPP, while in the second experiment bgc2orf achieved 83.3% accuracy, in comparison to 68.4% for NeuRiPP. [00163] Orf2core generates top k cleavages for each ORF, where k is a user-adjusted parameter. Supplementary Figure S11 shows the tradeoff between accuracy and number of predicted core peptides for selecting k. Moreover, orf2core includes a repeat-finder module for identification of cores with repetitive patterns (e.g. cyanobactins and plant RiPPs). On a test dataset of 165 cores from MIBiG database (excluding training data of DeepRiPP and orf2core), orf2core correctly identified 48.48% of cores, in comparison to 35.00% for DeepRiPP. When the top 5 pairs of cleavage sites were allowed, orf2core correctly identified 63.03% of cores. [00164] Bioactivity of novel ribosomal peptides from the human microbiota. We further tested the bioactivity of three novel N-formylated ribosomal peptides discovered by HypoN- PAtlas from the human microbiota against human GPCRs. The results show that these peptides have significant agonist activity at formyl peptide receptor 1 (FPR1) by measuring the induction of β-arrestin 2 recruitment. These results demonstrates the potential of hypoNPAtlas in discovering novel bioactive ribosomal peptides. [00165] Plant-seq2ripp is a variant of seq2ripp algorithm tuned for discovering novel BURP-domain-derived ribosomal peptides from plant species, e.g., lyciumin, legumenin, mono- and bicyclic cyclopeptide alkaloids, cercic acid and stephanotic acid. [00166] The algorithm was applied to transcriptomic and metabolomic datasets collected on 62 plant species in order to test its potential for characterizing new RiPP chemistry from plants. Within these datasets, plant-seq2ripp correctly connected known plant RiPPs of known classes with corresponding BURP-domain precursor peptides, for example, legumenin with AhyBURP, lyciumin B with LbaLycA, cercic acid with CcaBURP1, stephanotic acid with CcaBURP2, and cyclopeptide alkaloids selanine A and B with SkrBURP. Moreover, the algorithm discovered novel plant RiPPs of known classes including four new bicyclopeptide alkaloids (cores: ILLYPSY, VLFYRSY, FLLYPY, FLLYPSY) and one new monocyclic cyclopeptide alkaloid (core: ILLY) derived from Selaginella kraussiana precursor SkrBURP, two lyciumins (core: QPFGVFAW, QPFGVFSW) derived from a BURP-domain precursor of Jeffersonia diphylla, and a stephanotic acid (core: QLKVW) derived from Cercis canadensis precursor CcaBURP2. The validation of aforementioned plant RiPPs are conducted by transient expression of the matched BURP-domain precursor gene in N. benthamiana. Among these, Jeffersonia diphylla is a member of the Berberidaceae, and in this genera no lyciumins have previously been reported. For six of the novel plant RiPPs, we conducted transient expression of the matched BURP-domain precursor gene in N. benthamiana via Agrobacterium tumefaciens LBA4404 infiltration with the pEAQ-HT expression system. After collecting LC-MS/MS, spectra corresponding to the observed ORFs were detected, confirming that the predicted spectra are indeed produced by the predicted ORFs. [00167] Plant-seq2ripp predicted a novel BURP-domain-derived cyclic pentapeptide with a mass of 615.773 Da from Elaeagnus pungens with a Pro-Tyr- macrocyclization, shown in FIG. 6F. This predicted crosslink was characterized by NMR and Marfey’s analysis of the purified natural product named elaeagnin. The elaeagnin PTM, a macrocyclic ether bond between the β-carbon of a proline and the phenol-hydroxy- group of a tyrosine, was previously proposed in a plant peptide derived from soybean BURP-domain precursor GLYMA 04G180400 and seq2ripp was able to discover this PTM using variable search of mass spectra, without the need to incorporate the modification. Elaeagnin was verified as a RiPP by transient expression in N. benthamiana and subsequent detection of elaeagnin in transgenic tobacco leaves after six days. 14 tandem mass spectra are included, where 7 of them are the spectra of six confirmed peptides and elaeagnin (on left), and another 7 are their confirmation in tobacco (on right). The discovery of elaeagnin, which belongs to a new class of BURP-domain-derived RiPPs defined by the Pro-Tyr- macrocyclization, showcases the power of seq2ripp in identifying novel classes of RiPPs. [00168] HypoNPAtlas Server. The hypoNPAtlas webserver currently includes hypothetical RiPPs from 22,671 complete microbial genomes from RefSeq. Users can select specific strains / taxonomic clades and download the corresponding BGC, ORF, core and molecular structure data. Additionally, the hypoNPAtlas webserver supports the processing of input genomic data from users. The fragmentation of all the molecular structures are preprocessed and can be searched against mass spectral datasets using Dereplicator+. [00169] FIG. 6D shows an example process 370 for determining a molecular structure based on genomic data. At step 372, the data processing system 102 identifies BGCs through their modification enzymes. For example, dehydration enzymes are frequently found in some classes of RiPPs. At step 374, short ORFs within 10,000 bp of these enzymes are detected by the data processing system 102 as candidate structural ORFs. At step 376, fragments of the structural ORFs with lengths between 5 to 30 amino acids are extracted by the data processing system 102 as candidate precursor peptides. The data processing system 102, depending on the tailoring enzymes in the BGC, optionally applies corresponding modifications to the precursor peptides to form hypothetical molecules [00170] FIG.6E shows an example process 380 for mapping structures of molecules to spectra data. More specifically, FIG.6E shows an annotation of radamycin mass spectra from GNPS actinomycete dataset MSV000078937 using a dereplicator+ model. The fragments at depth 1 are shown in bracket 382, while the fragments at depth 2 are shown in bracket 384. Fragments and peaks that are annotated by dereplicator+ but not by the string scoring method are marked with checkmarks. These can either correspond to depth two fragments or fragments resulting from non-amide bond breakage. Associated spectra 386 are shown below. By switching from a string-based model to this graph based model, the score of correct radamycin match increases from 9 to 25, and the p-value drops from 3 × 10−17 to 3 × 10−46. [00171] The data processing system 102 derives a complete chemical structure graph of RiPPs from their BGC (rather than a precursor peptide along with modification masses). To enable this, the data processing system 102, based on precursor peptides, models tailoring enzymes as graph-modifiers. Each enzyme searches a particular chemical motif in the molecular graph of a RiPP, and whenever it finds the motif, it optionally applies the corresponding tailoring modification. [00172] The data processing system 102 predicts hypothetical molecules by applying modifications corresponding to tailoring enzymes present in the BGC. The data processing system 102 does this by extracting all the information from the known RiPP tailoring enzymes and their corresponding modifications. Given a core sequence and a list of tailoring enzymes the data processing system 102 predicts all the hypothetical RiPP structures using the following steps. First, the chemical structure of the core sequence is represented as a graph, where each vertex is an atom with an index number and the type of atom, and each edge is a bond between two indexed atoms and the type of chemical bonds. Next, for each modification, the location of motifs are collected by subgraph isomorphism. Finally, the putative combinations of modifications are calculated and applied to the core sequence. [00173] Breakthroughs in genome mining and mass spectrometry data collection have revolutionized the field of natural product discovery during the last decade. Development of popular genome mining tools such as antiSMASH has made it possible to quickly profile microbial genomes for detecting natural product BGCs. However, the current state-of-the-art approach for connecting BGCs to molecules is through the expression of the BGC in a heterologous host and subsequent isolation and structure elucidation of the product, which is a slow and expensive process. Therefore 99% of BGCs extracted from microbial genomes and stored in public repositories remain orphan, i.e. they are not linked to any small molecule. [00174] On the mass spectrometry front, development of the GNPS repository along with molecular networking has made it possible to annotate known natural products and discover their novel variants. Currently, GNPS hosts more than a billion mass spectra from more than five hundred laboratories. However, only 2% of GNPS spectra has been annotated as known molecules or their analogs. It has been hypothesized that a portion of the annotated spectra from GNPS is likely to correspond to orphan BGCs from microbial genomes. [00175] To fully utilize the power of recently developed repositories of microbial BGC and mass spectral datasets, computational techniques for high-throughput linking of BGCs to mass spectra are needed. However, in order to link mass spectral datasets to BGCs from microbial genomes, one first needs to predict the hypothetical structure of the molecular product of these BGCs. In order to fill in this gap, we present hypoNPAtlas, a public repository of hypothetical natural products predicted by mining microbial genomes. In the case of RiPP natural products, using the seq2ripp algorithm we populated this Atlas with 31,483,329 RiPPs from 22,671 complete microbial genomes available from RefSeq. [00176] Seq2ripp identified three novel microbial RiPPs from the PoDP datasets, and ten novel plant RiPPs from 62 plant metabolomics and transcriptomic datasets. Through variable mass spectrometry, seq2ripp discovered a plant RiPP with a novel PTM from Elaeagnaceae, highlighting the power of this method for identification of novel classes of natural products that were missed by previous approaches. Bgc2orf and orf2core modules within seq2ripp are capable of identifying RiPP precursors and cores without overfitting. [00177] Hypothetical molecules from hypoNPAtlas can be filtered to user specified taxonomic clades, and the retained molecules can be queried against mass spectral datasets using Dereplicator+, an in silico database search tool available from GNPS. Previous techniques modeled RiPPs as strings of amino acids, along with post-translational modifications. In contrast, hypoNPAtlas models RiPPs as graphs with atoms and chemical bonds, improving accuracy in representations of post-translational modifications (e.g. cyclizations). The graph model also improves the accuracy of in silico methods in predicting fragmentation of RiPPs containing nonstandard amino acids (e.g. oxazole and thiazole). [00178] Another challenge in linking mass spectral datasets to microbial BGCs is that over 99% of spectra from GNPS are collected on strains with unknown DNA sequence from complex environments. HypoNPAtlas infrastructure supports searching mass spectral datasets against taxonomic clades, allowing for novel natural product discovery from datasets without genomic information. This often results in discovery of natural products from mass spectra of one strain against the genome of a different strain. For example, radamycin was identified by searching mass spectra of a marine Streptomyces against the genome of a tomato flower symbiont Streptomyces. [00179] Currently, hypoNPAtlas reports on average 84 ORF, 1605 cores and 1753 molecules per RiPP BGC. Multiple RiPPs have been recorded because (i) many RiPP BGCs have multiple RiPP products, e.g. many cyanobactin and plant RiPPs have multiple cores per each ORF, and (ii) the activity of many RiPP enzymes remain ambiguous. By keeping track of multiple hypothetical RiPPs per BGC, we can increase the chance of capturing the correct RiPP. The natural drawback of this strategy is that majority of RiPPs in the Atlas will be spurious. By searching against mass spectral repositories, hypoNPAtlas enables identification of a large number of novel RiPPs, providing the training data for machine learning approaches to improve the prediction accuracy of post-translational modifications. [00180] The pipeline for natural products discovery by hypoNPAtlas consists of the following steps. Extracting BGCs and predicting hypothetical RiPPs. BGCs are either imported from IMG- ABC, antiSMASH-DB, and BiG-SLiCE, or mined from RefSeq/IMG-M. Hypothetical RiPPs are predicted from BGCs using seq2ripp. Filtering the Atlas using taxonomic information. The entire Atlas is too large for down- stream analysis. Users can filter the Atlas using specific taxa terms, and then download the entire hypothetical molecules in that clade in the SMILES format, along with corresponding BGCs, ORFs and cores. For example, the Atlas contains 20 BGCs, 48 ORFs, 273 cores and 120,701 molecules for the strain Streptomyces globisporus NRRL B-2709. [00181] Predicting spectra of hypothetical molecules from the Atlas. We used Dereplicator+ to pre- calculate the fragmentation graphs for all the molecules from the Atlas. The fragmentation graphs are stored as binary files, and they are also downloadable for user-specified taxonomic clades. It is much faster to search mass spectra against pre- calculated fragmentation graphs using Dereplicator+ than it is to search against raw SMILES structures. [00182] Mass spectral datasets can be searched against the SMILES structures / pre- calculated fragmentation graphs using Dereplicator+ from the GNPS infrastructure. HypoNPAtlas has been designed in a way that its interface is fully compatible with GNPS. [00183] False discovery rate (FDR) based on target decoy analysis reported by Dereplicator+ are used. Molecular networking. Currently, feature-based molecular networking supports annotations from Dereplicator+, and these networks can be visualized through the GNPS infrastructure. [00184] MetaMiner is a computational technique for discovery of novel RiPPs. The MetaMiner pipeline analyzes the paired genome / metagenome assemblies and tandem mass spectra from isolated microbes or bacterial or fungal communities. Starting from the genome assemblies, MetaMiner (i) identifies putative BGCs and their corresponding precursor peptides, (ii) constructs target and decoy putative RiPP structure databases modeling them as strings, and (iii) matches tandem mass spectra against the constructed RiPP strings using Dereplicator. [00185] Given a microbial genome sequence in fasta format, seq2ripp predicts hypothetical RiPP molecules (in the SMILES format) in the following steps. From genome to BGCs. RiPP BGCs are identified by the genome2bgc module, using 152 hidden Markov model (HMM) profiles, corresponding to frequent RiPP tailoring enzymes for various classes of RiPPs. Alternatively, seq2ripp also supports RiPP BGC identification using antiSMASH. [00186] From BGCs to ORFs. ORFs are extracted from BGCs by bgc2orf, a deep neural network (e.g., FIG. 6C). Additionally we also support the exhaustive strategy recruited by MetaMiner a BLAST-based strategy, and strategies from DeepRiPP and NeuRiPP In the exhaustive strategy, we all short ORFs with length longer than 10 aa are considered as feasible ORFs. While this strategy is very sensitive, it can result in a high number of false positives. In BLAST-based strategy, ORFs identified by the exhaustive search strategy are aligned against a database of 525 known RiPP ORFs with blastp, and those with e-value lower than 0.01 are retained. In bgc2orf, a sequence classification model is trained which includes a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) Network, based on 2,726 amino acid sequences of known RiPP ORFs and 19,224 amino acid sequences of non-RiPP ORFs. Then for each short ORF in the BGC, whether the ORF is a structural ORF or not is predicted, and the system only considers those with high probabilities (higher than 50%) as hypothetical structural ORFs. [00187] Bgc2orf is trained as follows: first, all input sequences are padded to a length of 200 amino acids. Each amino acid, including the padding symbol, is tokenized and embedded into a vector of size 100. The model includes two 1D CNNs. One CNN convolves the in- put sequence in terms of its topology, the other CNN convolves the tokenized vectors. The convolution on the sequence helps the model to detect the RiPP features on amino acids; the convolution on the token vector summarizes the embedded character information in high dimensional space. The outputs from the two CNNs and the input embeddings are concatenated and fed into a single layer bidirectional LSTM. The LSTM learns and summarizes the sequential features from the amino acid chain. The output of the LSTM is flattened and converted to a binary output with a flattened layer and a dense layer. The prediction loss is calculated by cross entropy loss during the training of the model. The learning rate begins with 1e-3 and decreases 10% every 40 epochs. [00188] ORFs to cores. Core peptides are detected from ORFs by orf2core, a deep neural network. Similar to the previous step, we also support an exhaustive strategy from MetaMiner, DeepRiPP, and a BLAST-based strategy. In the exhaustive strategy, all the peptide fragments with lengths between 3 to 30 aa are considered as candidate core peptides. In the BLAST-based strategy, the ORFs are aligned with 525 known RiPP cores by blasp, and the part of ORFs aligning with the core sequence with e-value lower than 0.001 are extracted (allowing for an error of up to two bases on each sides). In orf2core, we framed the task as a phoneme discovery problem, where the input is an amino acid sequence, and the output is putative cleavage sites. The system trains a CNN-LSTM- Conditional Random Field (CRF) model on the cleavage sites information of 3169 known RiPP ORFs. [00189] The system includes a discriminatory deep learning model to predict core and non-core frames of sequences given the amino acid sequence of an ORF as the continuous input. All input sequences are padded to a length of 200 amino acids. Each amino acid is tokenized and embedded into a vector of size 25. The model includes two 1D CNNs. One CNN convolves the input sequence in terms of its topology, and the other CNN convolves the tokenized vectors. The convolution on the sequence helps the model to detect amino acids surrounding the enzymatic cleavage site. The convolution on the token vector summarizes the embedded character information in high dimensional space. The outputs from the two CNNs, and the input embeddings are concatenated and fed into a single layer bidirectional LSTM. The LSTM learns the translation from the amino acid chain to core and non-core frames of sequences. The prediction loss is calculated by a conditional random field layer, which calculates the negative log-likelihood during the training of the model, and performs the Viterbi algorithm to optimize labels during prediction. An additional approach will be triggered when repeat patterns are observed in an ORF, by searching repeated prefix and suffix patterns in the sequence. [00190] To use Dereplicator+ for the discovery of RiPPs, the system derives the complete chemical structure graph of RiPPs from their BGC (rather than a precursor peptide along with modification masses, required by Dereplicator). To enable this, the system starts with precursor peptides and model tailoring enzymes as graph-modifiers. Each enzyme searches a particular chemical motif in the molecular graph of a RiPP, and whenever it finds the motif, it optionally applies the corresponding tailoring modification (See FIG.6F). [00191] Core2ripp predicts hypothetical molecules by applying modifications corresponding to tailoring enzymes present in the BGC. We do this by extracting all the information from the known RiPP tailoring enzymes and their corresponding modifications by literature mining and parsing them in a computer-readable format. The format includes a motif (stored as a SMILES string) along with a series of graph modifications (addition/removal of nodes and edges) that are applied to the molecular structure whenever the motif is observed, as shown in FIG.7. [00192] Given a core sequence and a list of tailoring enzymes, core2RiPP predicts all the hypothetical RiPP structures using the following steps. First, the chemical structure of the core sequence is represented as a graph, where each vertex is an atom with an index number and the type of atom, and each edge is a bond between two indexed atoms and the type of chemical bonds. Next, for each modification, the location of motifs are collected by subgraph isomorphism. Finally, the putative combinations of modifications will be calculated and applied to the core sequence (see details below). The final product will be stored in the SMILES format. [00193] Finding motifs by subgraph isomorphism. The system models the problem of searching motifs in the precursor RiPP sequence as a subgraph isomorphism problem. In the subgraph isomorphism problem, given graphs A (query) and B (subject), the problem is to determine if there is a subgraph of graph B that is isomorphic to the graph A. Here, the subject is the chemical structure of the precursor peptide sequences, and the query is the chemical structure of the motifs. For both the subject and the query, vertices are atoms, and the edges are bonds. The subgraph isomorphism problem is shown to be NP-hard. [00194] Ullman’s subgraph isomorphism algorithm is updated for motif finding, which makes it three orders of magnitude faster than the original approach. Ullman’s subgraph isomorphism algorithm builds an m by n binary correspondence matrix, where m is the number of nodes in the query, and n is the number of n does in the subject. The correspondence matrix has a 1 whenever the following two constraints hold: (i) the corresponding query and subject atoms have the same label (e.g. they are both carbon), and (ii) the multi- set of neighboring nodes for the query is a subset of the multiset of neighboring nodes for the subject. Then, for every row in the correspondence matrix, a single column containing a 1 is selected. This results in a mapping from query node indices to subject node indices. Then, the algorithm checks whether this mapping defines an isomorphism, i.e. all the edges in the query are present in the subject. Whenever this constraint is violated, the algorithm backtracks to select a new column with 1 from one of the rows. [00195] The Ullman algorithm is modified by incorporating edge-level labels: the multiset of neighboring node and edge pairs for the query is a subset of the multiset for the subject, where the edge is labelled as a single/double/triple bond. We further remove the hydro- gen atoms from both query and subject to accelerate the algorithm and avoid unnecessary computations. Moreover, by constructing the query and the subject into spanning trees, we are able to enforce that in each iteration, the atom selected from the query/subject be connected to the query/subject atoms added in the previous iterations without additional computations, as shown in FIG. 7 parts a and b. [00196] Motif-based modification. MetaMiner models RiPPs as strings of amino acids, and then modifications are applied to amino acids as mass shifts. In contrast, seq2ripp models RiPPs as graphs, and modifications are applied as graph modifications on the reaction motifs. We define modifications in a computer-readable format, using four actions: add, remove, connect, and disconnect. The add/remove actions are used for adding and removing atoms, while connect/disconnect actions are for adding and removing edges, as shown in FIG.7 parts c and d. [00197] Given a list of tailoring enzymes in the BGC, core2ripp predicts the hypothetical products by adding the different combinations of modifications on the core peptide. If there are 10 feasible modifications at different locations, this procedure produces 1024 possible products. These hypothetical structures are saved in SMILES format. [00198] FIG. 7 shows a process 700. First, at (a), the hydrogens are removed from the query and the subject. Next, the atoms are labeled by a breadth first traversal of each graph from the first non-hydrogen atoms in the input file. Finally, an adjacency matrix for each graph is prepared. At (b), only the mappings that define an isomorphism are kept. Converting known enzymatic modifications of RiPPs into a computer-readable format for dimethylation of N-terminal, e.g. in cypemycin and oxidative decarboxylation, e.g. cypemycin and epidermin. In this format, commands disconnect/connect are used (for bonds) and add/remove (for chemical substructures). For example, in part (c), “disconnect 15” removes the bond between the nitrogen atom with index 1 and the hydrogen atom with index 5, while “remove 5” removes the hydrogen atom with index 5, and “Add 1 CH3” adds a methyl group to the nitrogen atom indexed 1 (methylation). In part (d) “connect 9 142” adds a double bond between atom 9 and atom 14. [00199] Estimating the statistical significance of metabolomics annotations [00200] In case of mass spectrometry base proteomics, modern database search strategies recruit statistical methods to evaluate the matches between peptides and spectra. Currently, the dominant technique for statistical evaluation of a set of PSMs is to estimate false discovery rate (FDR) using the Target Decoy Approach (TDA). In this approach, a database of fake (decoy) peptides are searched against mass spectra in addition to correct (target) database. Then the ratio of high score matches in the decoy database to that of target database is used as an estimate of FDR. Decoy peptides are usually generated by shuffling target peptides. Methods below are discussed to generalize this method to other natural products with more diverse structures. [00201] In proteomics, identified PSMs are statistically evaluated through the computation of p-values. The p-value of a PSM is defined as the fraction of random peptides with a score equal to or exceeding the target PSM. To compute p-value, the data processing system 102 estimates the distribution of the score of random peptides against the spectrum. [00202] Natural product small molecules have diverse graph structures, such as linear, cyclic, branch-cyclic, and multiple ring structures. The process now described is configured for statistical validation of metabolomics identifications by converting match scores to p-values. This process overcomes existing technical hurdles. Current methods (such as Markov Chain Monte Carlo (MCMC)) for estimating p-values do not generalize to other types of natural products since it is not clear how to generate random small molecule structures, and how to randomly modify the structure of small molecules. To overcome this, the data processing system 102 is configured to use constrained graph variational auto- encoders (CGVAE) for random generation and mutations of small molecules. [00203] Matching score and p-values [00204] A scoring function score(M, S) for a match between molecule M and spectrum S. Let M and S denote the set of all possible molecules and spectra. Molecule- based p-value for a pair of molecule and spectrum (M, S) is defined as: [00205] where m is generated from a random uniform distributed on set and = Score (M, S). The probability p defined above depends on the choice of set M. Similarly, spectrum-based p- value is defined as: [00206] ) where s is generated from a random uniform distributed on set. Molecule-based p-values help in reducing bias toward molecular features, while spectrum-based p-values reduce bias toward spectral features. The data processing system 102 uses constrained graph variational auto-encoders (CGVAE) for generating random molecules, and also generates random spectra. [00207] Monte Carlo approach [00208] Monte Carlo simulation is a robust technique for simulating events and estimating their probabilities. Consider the set [00209] [00210] [00211] Define 1M" as an indicator function of the set M* such that 1M"(m) equals 1 if m Î M" (equivalently, Score(m) ³ t) and 0 otherwise. Let m1, . . . , mN be N independent and identically distributed (iid) random variables sampled from a uniform distribution ƒ on set M. The p-value p defined in equation (10) can be written as an expectation: [00212] where r is the sampling distribution over M. According to the strong law of large numbers, since m1, . . . , mN are iid samples, the probability p converges almost surely (e.g., with probability 1) to the following: [00213] [00214] p̂MC is an unbiased and consistent estimator of p-value p. Variance of p̂MC equals to p(1−p), which tends to 0 as N. Therefore Monte Carlo sampling is used to compute the probability p. [00215] Importance sampling [00216] Importance sampling is a general Monte-Carlo approach for reducing the variance of expectations quantities. Importance sampling generates the events of interest more often by sampling from a different distribution and correcting for the bias afterward, which results in a more accurate estimate with a lower number of samples. Formally, assume m1, .. . , mN are sampled from a distribution with density p. Then for a probability distribution q(m), using equation (16): [00217] [00218] The importance sampling estimator of p is defined as: [00219] where v1, .. . , vN are iid samples from q(v), called the importance sampling density, and w (vi) = p(vi) is the importance weight of sample vi. [00220] Markov Chain Monte Carlo [00221] The importance sampling density q has is selected in a way that relative error is minimized. The optimal density is: [00222] [00223] In this case, all generated samples fall in Ui # M", so their importance weights are w (mi) = p, and the IS estimate p̂ IS = p. In this case, just one sample (N = 1) is enough to find the probability p exactly. To estimate the optimal importance sampling density, the data processing system 102 uses an iterative strategy. In the first iteration, a uniform distribution q is used, and in each iteration, the score probability distribution P(Score(m, S) = t) is estimated, and the weight and density are updated as: [00224] [00225] [00226] Here, we assume density q and weight w are uniform for each score. Usually this task is approached by using the Metropolis-Hastings algorithm that allows for the usage of unnormalized densities. The data processing system 102 is configured to generate a Markov Chain having Q as an equilibrium distribution and obtain the desired sequence {νn} of random variables via sampling from this distribution. The ergodicity of the Markov chain ensures that as N approaches infinity, p̂ IS still converges to the target probability p. [00227] Consider a transition kernel γ(x|y) that defines a jump in M. The Metropolis Hastings algorithm realizes the desired sampling strategy as follows. Starting from a random initial point m0, in each step a new point m" is samples from γ(x|y). If w(m") > w(mi), the transition is accepted. Otherwise the transition is accepted with probability w(m")/w(m i ), and rejected otherwise. If the transition is accepted, m i is set to m", otherwise mi is set to mi−1.
[00228] There are two ways for computing the statistical significance of a molecule spectrum match. A first method fixes the molecule and computes the ratio of random spectra getting a score higher than the target match (spectrum-based p-value). A second method fixes the spectrum and computes the ratio of random molecules getting a score higher than the target match (molecule-based p-value). In case of spectrum-based p-values, the data processing system 102 generates (and modifies) random spectra based on the statistics of peaks / intensities learned from the GNPS library. In the case of molecule- based p-values, the data processing system 102 uses constrained graph variational autoencoders (CGVAE) for random generation and modification of molecular structures. [00229] Generating random spectra [00230] The data processing system 102 determines a statistical significance for generated molecule structures by producing a target-decoy database of random MS/MS spectra for small molecules. For this part the set M includes of generated spectra in order to calculate equation (10) for each pair of (molecule, spectrum). Two strategies are used to create the decoy MS/MS database. [00231] A first method to create the decoy MS/MS database is the naïve method. For the naive decoy spectral library, all possible fragment ions from the reference library of spectra are used. The data processing system 102 randomly adds these ions to the decoy spectral library until each decoy spectrum reaches the desired number of fragment ions that mimics the corresponding library spectrum. This method is presented as a baseline evaluation of the other, more intricate method. [00232] The spectrum-based method is similar to the naive method, as the data processing system 102 generates a decoy spectral library through choosing fragment ions that co-appear in the spectra from the target spectral library. In this spectrum-based approach, the data processing system 102 starts with an empty set of fragment ion candidates. First, the data processing system 102 randomly selects the precursor fragment ion of the target spectrum. For each fragment ion added to the decoy spectrum, the data processing system 102 chooses all spectra from the target spectral library which contain this fragment ion, within a threshold (=5 p.p.m). From these spectra, the data processing system 102 uniformly draws (all fragment ions have the same probability to be drawn) five fragment ions that are added to the fragment ion candidate set. The data processing system 102 draws a fragment ion from the fragment ion candidate set and add it to the decoy spectrum, then proceed as described above until we reach the desired number of fragment ions that mimics the corresponding library spectrum. The two-step process of first drawing candidates, then drawing the actual decoy spectrum is introduced to better mimic fragmentation cascades and dependencies between fragments. Furthermore, it prevents that fragment-rich spectra dominate the process. [00233] Generating Random Molecules [00234] A CGVAE model is used by the data processing system 102 to generate random molecules. The process is seeded with N vectors that together form a latent specification for the graph to be generated (N is an upper bound on the number of nodes in the final graph). The model uses a variational autoencoder to design molecules that are close to the real molecules in the nature. [00235] Fast Mass Spectrometry Searches of Untargeted Metabolomics Data using MASST+ [00236] The throughput rate of mass spectrometers and the size of publicly available metabolomics data is growing rapidly. Illuminating the molecules present in untargeted mass spectrometry data that cannot be identified by existing approaches. MASST is configured to search a query spectrum against a database and cluster families of related molecules in untargeted mass spectrometry data. MASST+ is configured to scale to searching and clustering billions of mass spectral data in large metabolomics repositories, e.g. the global natural product social (GNPS) molecular networking infrastructure. By utilizing the strategy to preprocess and sort peaks in mass spectra described in this section of the specification, MASST+ searches a query spectrum against billions of spectra in less than an hour and Networking+ maps the chemical diversity of the entirety of billions of spectra in the entire GNPS in less than a week on a single compute node. [00237] MASST and molecular networking are based on a naive approach for scoring two spectra. MASST compares the query spectrum against all of the reference spectra one by one and computes a similarity score based on the relative intensities of shared and shifted peaks. Therefore, the runtime of MASST grows linearly with the repository size. Molecular networking first uses MS-Clustering28 to cluster identical spectra by calculating the dot-product score (see FIG. 8 part a) between the spectra. Then Spectral Networking29 is used to calculate a dot product score that accounts for peaks that are shared or shifted (see Figure 8 part b) between all pairs of clusters. This latter procedure grows quadratically with the number of clusters. Current trends show that the size of public mass spectral repositories doubles every two to three years. Therefore, the current implementations of MASST and Molecular Networking will not be able to scale with the growth of future repositories. A MASST search for a single spectrum against the clustered global natural product social (GNPS) database (83 million spectra) currently takes about an hour on a single thread and a MASST search against the entire GNPS (717 million spectra) does not complete after being run for three days. A molecular networking job on a million spectra finishes in about 1.5 hours, while a molecular networking job on 20 million spectra does not yield results after running for a month. Similar to computational genomics, handling the exponential growth of repositories requires the development of more efficient and scalable search algorithms. [00238] To overcome these inefficiencies, MASST+ and Networking+, scalable methods are used for querying and clustering billions of spectra that are two to three orders of magnitude more efficient than existing methods. MASST+ and Networking+ preprocess all the peaks in all the spectral datasets and construct an indexing table that for each pair of masses $ and %, records all the spectra with precursor m/z $ that contains peak with m/z %. Then, given each query spectrum, the ExactScore / ShiftedScore against spectra in all datasets can be computed by iterating through each query peak and using the indexing table to efficiently retrieve spectra with similar peaks (Figure 2). Since mass spectra are sparse, only a small fraction of spectra/peaks are retrieved for each query, making these approaches two to three orders of magnitude more efficient than the state of the art. Moreover, these approaches support insertion of new spectra into the table without the need for recalculating from scratch. MASST+ is currently available as a web service from https://masst.ucsd.edu/masstplus/. GNPS supports stand-alone MASST+ and integration with molecular networking. [00239] FIG.8 shows a similarity score process. For part a, in a case of exact search, MASST searches a query spectrum against all database spectra with similar precursor masses, and computes ExactScore as the sum of multiplications between peaks shared by the query and database spectrum (shown in solid grey). In this case the score is 6.2 * 3.2 + 10.2 * 16.3 = 186.1. For part b, in a case of analog search, MASST searches the query spectrum against all database spectra within a specific precursor mass range (e.g. 300 Da) and computes the ShiftedScore as the sum of multiplications between peaks that are shared and Δ-shifted between query and database spectrum. Here there is one shared (solid grey) and two Δ-shifted (dashed grey) peaks, yielding a total score of 6.2 *2.2 + 10.2 * 9.2 + 15.4 * 9.2 = 249.16. Δ denotes the precursor mass difference between query and database spectra. [00240] FIG. 9 shows a MASST+ Exact Search. At step (a), given a database of spectra MASST+ starts with (b) constructing an indexing table, where each row corresponds to a fragment peak mass, and contains a list of tuples of spectra indices that contain the peak, along with the intensity of the peak in these spectra. At step (c), given a query spectrum, MASST+ retrieves all lists corresponding to peaks present in the query. Then, at step (d) MASST+ iterates over each list, and for each tuple in the list, adds the product of the intensity of the query and database peaks to the total ExactScore of query and database spectra. [00241] Given a query spectrum, MASST+ searches a database of reference spectra to find similar entries by creation of an indexing table – a data structure which allows rapid retrieval of similar spectra based on the peaks present in the query spectrum. For each precursor mass $ and each peak mass %, a list of indices of spectra with precursor $ and peak % are stored, along with the intensity of the peaks. In case of exact search, MASST+ iterates through the peaks in query spectrum, and retrieves the lists corresponding to the peaks and precursor masses, within a tolerance threshold (e. g. 0.02Da in case of high- resolution data). The ExactScore is calculating by multiplying and adding up the intensity of each peak in query spectrum and reference spectra, shown in FIG. 9. In case of analog search, MASST+ uses a much larger precursor mass tolerance (e. g.300Da) and computes ShiftedScore that takes into account both shared and Δ-shifted peaks (peaks in reference spectra that are Δ Da larger than peaks in query), where Δ is the mass difference between the precursor of query and reference spectra, shown in FIG. 8. [00242] Networking+ algorithm clusters spectral datasets into families of related molecules by first putting spectra from identical molecules into the same clusters (Clustering+), then forming the centers of each cluster by taking their consensus, and then connecting the clusters that are predicted to be generated from related molecules (Pairing+). Clustering+ iterates over all spectra, and puts each spectrum in the cluster that is most similar. It uses a strategy similar to MASST+ exact search for efficiently calculating the SharedScore between the spectrum and each cluster center. Pairing+ uses a shared and Δ-shifted dot-product as similarity measure for identifying related spectra. It uses a strategy similar to MASST+ analog search to find all pairs of clusters with high ShiftedScore. [00243] Table 5: Benchmarking MASST+ search. [00244] For all queries, MSV000078787 (5.4k spectra), clustered GNPS (83.1M spectra), or entire GNPS (717.4M spectra) are used as the reference database. Search time, search memory consumption, and number of identifications are shown. For MSV000078787, clustered GNPS, and entire GNPS, MASST+ is two orders of magnitude faster than MASST while achieving the same or better memory consumption. MASST search did not yield results for entire in a reasonable time frame (three days threshold). In other cases, MASST+ reports are identical to MASST. [00245] MASST+ is benchmarked on various GNPS datasets including the dataset MSV000078787 collected on Streptomyces cultures (5,433 spectra), clustered GNPS (83,131,248 spectra), and entire GNPS (717,395,473 spectra). While MASST and MASST+ report identical hits, MASST+ is two orders of magnitude faster and as memory efficient (Table 5). In case of the clustered GNPS, MASST+ performs analog search in 54 seconds while MASST takes over one hour. In case of entire GNPS, MASST+ performed analog search in under two hours on average, while MASST search did not finish after three days. [00246] FIG. 10 illustrates the runtime and memory consumption of MASST+ in exact and analog mode for various subsets of the clustered GNPS. Indexing time and memory consumption grows linearly with the size of datasets. MASST+ takes eight hours of compute time and eight gigabytes of memory to index ~83.1 million spectra from the clustered GNPS. MASST+ is used for annotation of mass spectral datasets containing unknown molecules by searching them against 140M annotated spectra in ReDU repository. [00247] FIG. 10 shows graphs. Graph (a) shows MASST+ is two orders of magnitudes faster than MASST in exact and analog search for various database sizes. Graph (b) shows MASST+ outperforms MASST in memory efficiency. [00248] FIG. 11 shows graphs. Graph a) shows clustering+ is over two orders of magnitude faster than MS-Clustering. Graph b) shows pairing+ is over two magnitudes faster than Spectral and Networking. Graph c) shows networking+ is over two magnitudes faster than Molecular Networking. [00249] Lanthipeptides are a biologically important class of natural products that include antibiotics30, antifungals31, antivirals32, and antinociceptives33. Lanthipeptides are structurally defined by the thioether amino acids lanthionine, methyllanthionine and labionin. Lanthionine and methyllanthionine are introduced by dehydration of a serine or threonine to generate a dehydroalanine or dehydrobutyrine and addition of a cysteine thiol, catalyzed by a dehydratase and a cyclase, respectively34. During lanthipeptide biosynthesis, a precursor gene lanA is translated by the ribosome to yield a precursor peptide LanA that consists of a N-terminal leader peptide and a C- terminal core peptide sequence. The core peptide is post-translationally modified by the lanthionine biosynthetic machinery and other enzymes, proteolytically cleaved from the leader peptide to yield the mature lanthipeptide and exported out of the cell by transporters. [00250] Currently methods for high-throughput discovery of lanthipeptides through computational analysis of genomics and metabolomics data suffer from various limitations. Lanthipeptides usually possess specific network motifs that enables mining them in spectral networks. These motifs include mass shifts of -18.01Da (H2O mass) that corresponds to alternative number of dehydrations, and mass shifts equal to amino acid masses that corresponds to promiscuity in N-terminal leader processing. The system clustered the entire GNPS (717 million scans) using Clustering+, and formed the network using Pairing+. This resulted in 8.5 million clusters, and Y connected components with a total of 17 million edges. In order to mine this network for lanthipeptides, we focused on edges from a subset of 500 Streptomyces cultures with known genomes (Supplementary Table S4). 9,410,802 scans clustered into 354,401 nodes, 6,032 connected components, and 1,265,311 edges.29,639 nodes are retained that possess the network motif (connected to an edge with mass difference equal to a loss of H2O, NH3, or an amino acid mass). The system mined for nodes with long amino acid sequence tags of length 1235.There are a total of 2,353 nodes with sequence tags of length 12 or longer, and 285 of these nodes are connected to an edge with mass difference equal to H2O or an amino acid loss. The system inspected these nodes using our in-house software algorithm, Seq2RiPP which given a lanthipeptide precursor, generates all the possible candidate molecules considering different cores and various modifications, and then search the candidate molecular structures against mass spectra using Dereplicator36. This strategy identified three known and 14 novel lanthipeptides with p-values below 1e-15 (Table 6). [00251] Table 6 shows novel and known lanthipeptides discovered by network motif mining. The producer organism, name, sequence, Dereplicator score and p-value, mass and references are shown.
[00252] FIG. 12 shows one of the peptides (CHM-1731 from Streptomyces albus). Part (a) shows a biosynthetic gene cluster of CHM-1731. Part (b) shows an annotation of peaks in mass spectra of CHM-1731. B-ions (prefix fragmentations) are shown, and y-ions (suffix fragmentations) are shown. Part (c) shows a mass accuracy of annotations are shown in parts per million (ppm). Stars stand for dehydrated serine / threonine. [00253] An overview of MASST algorithm is now described. In an exact search mode, MASST performs exact search by retrieving the spectra in the database that have the same precursor mass as the query and computing SharedScore between each retrieved spectrum and the query. Analog search is conducted by retrieving all spectra using a large precursor mass tolerance (e.g.300 Da), and computing the ShiftedScore. To compute these scores, MASST iterates over all the peaks in the query spectrum, and for each peak it explores whether a peak with similar or shifted m/z is present in each database spectrum. Whenever such a peak is present, MASST increments the score between the query and the database spectrum by the product of the intensity of peaks in the query and database spectrum. [00254] MASST+ exact search. Given a query spectrum, MASST+ efficiently searches a database of reference spectra to find similar spectra by creation of an indexing table a data structure which allows rapid retrieval of similar spectra based on the peaks present in the query spectrum. For each precursor mass $ and each peak mass %, a list of indices of spectra with precursor $ and peak % are stored, along with the intensity of the peaks. In case of exact search, MASST+ iterates through the peaks in the query spectrum, and retrieves the lists corresponding to the peaks and precursor masses within a tolerance threshold of query (e. g. 0.02Da in case of high-resolution data). The SharedScore is calculated by multiplying and adding up the intensity of each peak in query spectrum and reference spectra. [00255] MASST+ analog search. In case of analog search, MASST+ use a much larger precursor mass tolerance (e. g. 300Da) and computes ShiftedScore. ShiftedScore takes into account both shared and Δ-shifted peaks, where Δ is the mass difference between the query and reference spectra. In analog mode, MASST+ iterates through each peak % in the query spectra with precursor mass $, and scans lists ($′, %′) where either % = %′ (shared scenario) or $ − % = $′ − %′ (shifted scenario). The ShiftedScore is calculated by multiplying and adding up the intensity of shared and shifted peaks in query spectrum and reference spectra (Supplementary Figure S8). [00256] To find structurally related families of small molecules, molecular networking first clusters spectra from identical molecules using MS- Clustering29, and then connects clusters of related molecules using spectral networking28. MS- Clustering puts two spectra in the same cluster if their precursor mass difference is below a threshold (usually 2 Da) and their cosine dot product (a normalized SharedScore) is above a certain threshold (usually 0.7). Then for each cluster, a consensus spectrum is constructed. In spectral networking, two consensus spectra are connected to each other if the shared-shifted cosine score (a normalized ShiftedScore) is above a threshold (usually 0.7). [00257] Networking+ includes two modules, Clustering+ and Pairing+. Clustering+ is implemented using a greedy procedure, wherein each iteration the similarity score between each spectrum and all the existing cluster centers are calculated (initially there are no clusters). To efficiently calculate the similarity score between the spectrum and all clusters, an indexing table similar to MASST+ exact search is constructed and iteratively updated. The indexing table stores the list of all clusters with a specific precursor mass $ and peak mass %. Whenever, the highest score between the spectrum and clusters is greater than a threshold, the spectrum is added to the maximal cluster, and its center is updated. If the highest score is below the threshold, then a new cluster is created consisting of the current spectrum (set as the center). This procedure continues till all the spectra are processed. [00258] Pairing+ computes a score similar to MASST+ analog search (Supplementary Figure S8) that accounts for Δ-shifted and shared peaks for all pairs of input spectra (e.g. cluster centers from clustering+). To do this, it constructs an indexing table similar to MASST+ analog search. Then the table is used to efficiently compute the score between all pairs of spectra. [00259] FIG. 13 shows process 1300. At step (a), given a spectral dataset, Clustering+ performs the following steps. Starting from (b) the existing cluster centers, (c) an indexing table is constructed. Then, (d) lists corresponding to each peak in the spectrum are retrieved from the indexing table, and (e) the ExactScore between query spectra and each cluster is calculated by adding up the product of intensity of peaks in query and the cluster. (f) If the maximum score is higher than a threshold, the spectra is added to the maximal cluster. (g) Otherwise, a new cluster is created, with the spectra as its center, and then (h) the indexing table is updated. [00260] FIG.14 shows a process 1400. Starting from (a) reference spectra, Pairing+ first (b) constructs the indexing table. Then (c) lists corresponding to each peak are retrieved from the indexing table, and (d) pairwise ShiftedScore between each pair of spectra are calculated. Here, only shared peaks are highlighted for the simplicity. Then (e) a similarity matrix is formed between all pairs of spectra, and (f) molecular network is formed by retaining all the pairs with ShiftedScore exceeding a threshold (e. g. 0.7). [00261] FIG. 15 shows a graph 1500 showing a growth of the GNPS database size since 2015. The size of the public GNPS database is projected to contain a billion spectra by the year 2026. [00262] FIG. 16 shows graphs 1600a and 1600b. MASST + indexing memory is at graph 1600a. Run time is shown at graph 1600b as database size grows. Both runtime and memory grow sub-linearly (linear growth shown on dashed line). On the clustered GNPS, MASST+ requires eight hours of and eight gigabytes of memory. Note that indexing need only be performed once for each database. [00263] FIG.17 shows a process 1700 for MASST+ analog search. At step (a) given a reference mass spectral database, MASST+ (b) constructs multiple indexing tables, each corresponding to spectra of a certain precursor mass. Each indexing table contains rows corresponding to a specific precursor mass, and each row contains a list of spectra in which a specific fragment peak is present, along with the intensity of the fragment peak in those spectra. Then (c) given a query spectrum, MASST+ iterates through each indexing table, and retrieves rows correspond to query peaks and Δ-shifted version of query peaks, where Δ is the precursor mass difference between the query spectrum and the indexing table. Then (d) MASST+ computes the ShiftedScore by adding up the product of the intensity of query peaks and database peaks. [00264] FIG. 18 shows an image 1800 of MASST+ Index Visualization. The MASST+ indexing table corresponds a two-dimensional grid, with precursor mass on the x-axis and peak mass on the y-axis. Each database peak is inserted into a list corresponding to a specific location in the grid, determined by the peak mass and the precursor mass. In exact search, for each query peak only the list in a single cell will be retrieved (highlighted with green circle). For analog search, red cells (corresponding the shared peaks) and blue cells (corresponding to Δ-shifted peaks) are retrieved. [00265] Table 7 shows data for Benchmarking Clustering+ and MS-Clustering runtimes for various sizes of spectral datasets (runtimes are shown in seconds). The cases where the search did not yield results within 24 hours are shown with N/A. [00266] Table 8 shows data for Benchmarking Pairing+ and Spectral Networking runtimes for various sizes of spectral datasets (runtimes are shown in seconds). The cases where the search did not yield results within 24 hours are shown with N/A. [00267] Table 9 shows a comparison of Molecular Networking and Molecular Networking+ runtimes for various sizes of spectral datasets (runtimes are shown in seconds). The cases where the search did not yield results within 24 hours are shown with N/A.
[00268] Table 10 shows a List of MassIVE datasets mined for lanthipeptides.
[00269] Table 11 shows a number of nodes for different tag lengths in the network of microbial datasets with known genomes from GNPS. The second column shows the total number of nodes, while the third columns show the nodes that possess the network motif. [00270] Pathogen-oriented platform for large-scale discovery of drug-like natural products discovers novel antifungal targeting urgent- threat drug-resistant Candidiasis. [00271] More than half of all drugs approved by the Food and Drug Administration are derived from bioactive natural products. Natural products are produced through complex biosynthetic pathways that are optimized through millions of years of natural selection that lead to a vast structural diversity. Due to their massive chemical and functional diversity, natural products are especially an excellent source for finding new treatments for orphan and drug-resistant pathogens. However, their complex biosynthesis and structure present a significant challenge for the discovery efforts in this field and curbed the pace of natural product discovery in the past decades. The urgent demand for new drugs targeting the orphan cancer types and the emerging drug-resistant bacterial, viral, and fungal infections, highlights the need for new discovery platforms that can target such pathogens. In this study we focused on non-ribosomal peptides (NRPs) that include many FDA-approved anti-infective and anti-cancer drugs. [00272] Natural products discovery in omics era. High-throughput omics technologies revolutionized the state of natural product analyses and created new opportunities for drug discovery. Tandem mass spectrometry (high-throughput metabolomics) proved itself as an inexpensive, scalable, and sensitive technology to rapidly analyze small molecules in microbial isolates and (environmental / host-oriented) communities. Currently, public repositories such as Global Natural Product Social (GNPS) molecular networking contain tens of thousands of datasets representing thousands of unknown microbial natural products. On the other hand, advances in (meta)genome sequencing enabled the sequencing of thousands of natural-product-producing biosynthetic gene clusters (BGCs, the chromosomally adjacent set of genes synthesizing microbial natural products), which are available through National Center for Biotechnology Information (NCBI) and the Joint Genome Institute (JGI) repositories. However, only 640 out of 500,000 BGCs in public genome assemblies are connected to their small molecules. Therefore, there exists a large gap between the number of sequenced BGCs and the number of BGCs with characterized natural product molecules that represent a goldmine for discovering novel bioactive molecules. [00273] Challenges of genome mining methods for discovering novel natural products. Currently, several end-to-end discovery platforms exist that acquire genomics data for discovering novel natural product molecules. These technologies all tend to first leverage genomic data rather than analytical chemistry to identify BGCs. Genome mining methods use information from known BGCs with already characterized molecules, to predict the structure of the molecules encoded by novel BGCs. Currently, various methods exist for identifying the BGCs from genomic data and predicting the potential structures of their encoding NRPs. However, the existing genome mining methods are plagued by low accuracy and resulted in less than 4% accuracy for NRP prediction (using Tanimoto coefficient threshold >90% between true and predicted structures). Additionally, large- scale genome mining experiments often identify thousands of BGCs where each BGC can encode various biosynthesis pathways, leading to millions of potential hypothetical molecules. Since only a very small portion of these potential molecules are the bioactive NRPs that are secreted by the organisms, additional analyses are necessary to find the needle in this haystack. Therefore, methods that only rely on genomics data often recruit wet- lab-intensive, cumbersome, costly, and one-off methods that usually are time- and labor- intensive8. In the past decade, several end-to-end platforms have been introduced for discovered novel NRP molecules that are secreted by the organism. [00274] Discovering novel peptides using peptide synthesis. Multiple methods use genome mining for finding BGCs with no tailoring modifications and then synthesizing the peptides predicted by genome mining. This approach is especially useful in the case of BGCs that remain silent in laboratory conditions. However, it requires perfect prediction of the final NRPs from the genomic data therefore cannot be applied to NRPs with unknown tailoring modifications or novel chemistry that cannot be predicted from the genome. [00275] Discovering novel natural products using heterologous expression. In the past few years, several end-to-end discovery platforms have been introduced for characterizing the secreted novel molecules using heterologous expression techniques. These platforms start by selecting a small pool of sequenced BGCs and then expressing them in a host organisms using synthetic biology, followed by isolation and nuclear magnetic resonance (NMR) spectroscopy. While this method allows one to discover novel molecules even when the genome-driven predictions are not perfect, but it faces multiple challenges that hinder its utility for large-scale novel discovery. Although many efforts have been made to develop optimized heterologous hosts for BGC expression, there is still no host that can work for molecules from various producers. Hence, the optimal host for each BGC should be explored using trial and error. Additionally, many unknown BGCs are still silent in heterologous hosts, requiring extensive genetic engineering efforts for their activation. Furthermore, due to large sizes of NRP BGCs spanning 10–200 kb, cloning them is often a technically challenging and time-consuming process, thus forming a major bottleneck in the overall process of heterologous expression. Due to these challenges, heterologous expression for natural product discovery is currently not suitable for large scale discovery of novel molecules. [00276] Existing metabolomics-based methods for NRP discovery. Some of the existing approaches leverage metabolomics data from microbial samples to characterize novel molecules. Goering et al introduced a discovery platform that focuses on analyzing the co-occurrence of isolated microbes and molecules. However, this approach does not leverage the structural information provided by the biosynthetic gene clusters in an automated fashion and is limited to cases when the encoding BGC and molecules co-occur in multiple organisms. [00277] Limitations of NRPminer. In addition to the methods focusing on correlations, metabolomics data can be used to directly characterize the novel molecules by searching the spectral datasets against the genome-driven hypothetical structures. NRPminer4, the only existing method for scalable NRP characterization uses metabolomics data. NRPminer suffers from several limitations and cannot deliver pathogen-specific drug-lead discovery. Firstly, NRPminer relies on antiSMASH for genome mining, leading to low accuracy (<1% accuracy in predicting NRP structures with Tanimoto similarity threshold %90 between true and predicted structures). Moreover, NRPminer models NRPs as strings of amino acids (aa's), significantly limiting its power in handling more complex fragmentation patterns beyond amide bonds in mass spectrometry and reducing sensitivity. Furthermore, while many drug-like NRPs are extensively modified, NRPminer does not support post-assembly modifications. Additionally, NRPminer cannot predict the bioactivity of the identified molecules. Finally, NRPminer is currently slow and cannot search billions of spectra that are present in public datasets or are often generated in large-scale microbial projects. Due to these challenges, NRPminer was not able to find any novel molecules active against any human pathogens. [00278] In silico prediction of natural products' bioactivity from BGCs. Finding molecules active against certain pathogens is an urgent need. Therefore, assessing bioactivity of the identified molecules is an essential step in discovery efforts. However, since the experimental assessment of the bioactivity of the natural products encoded by all sequenced BGCs is infeasible, methods for predicting the activity of natural products from BGCs are necessary. Existing machine learning methods rely on genomic data and the list of enzymes on a given BGC. These methods use BGCs with already characterized molecules as training data. However, since there is a limited number of natural products that are connected to their BGCs, relying only on genomic features result in poor performance (<1% true positives at 10% false positive rate across broad antifungal, antibiotic, and anticancer outcomes). Furthermore, these methods cannot predict bioactivity at pathogen-level, rendering them ineffective for finding molecules active against specific pathogens of interest. Finally, these approaches cannot predict the potential mode of action (MOA), and it is currently not possible to identify the molecules with known mode of actions, to clear the way for focusing on novel MOAs. [00279] NPDiscover platform for discovering bioactive NRPs. To address these challenges, we developed a new platform to identify novel bioactive NRPs. NPDiscover uses novel algorithms to accurately predict the hypothetical structures of mature NRPs, improving the state- of-the-art by three folds. NPDiscover is the first scalable method that can incorporate tailoring enzymes present in the BGCs. NPDiscover uses a novel clustering approach to combinatorically apply the identified tailoring modifications and efficiently predicts possible mature structures. Additionally, NPDiscover is the first tool that can identify active NRPs with a novel chemistry (for example a novel monomer or a novel tailoring modification) by rapidly analyzing mass shifts in the spectral matches across billions of spectra, using a variable search strategy. Afterwards, NPDiscover uses a new Markov Chain Monte Carlo strategy to efficiently calculate statistical significance of the identified NRP-spectrum matches. It then uses a machine learning method that utilizes structural features of statistically significant predictions to predict their potential bioactivity and MOA. After these computational analyses, NPDiscover provides detailed information for each identified NRP, including molecular mass, retention time, structural and activity predictions, predicted MOA, and genomic loci. This information allows the investigators to directly isolate and purify the molecules of interest from using a mass- guided HPLC-purification approach. [00280] Discovery of novel antifungal NRP that is active against urgent-thread multidrug- resistant fungal pathogens. The system demonstrated the power of NPDiscover in identifying novel NRPs that are active against drug-resistant and orphan pathogens, we searched datasets from 119 Actinobacteria strains. Our downstream experimental analyses confirmed the structure and bioactivity against drug-resistant candida auris predicted by NPDiscover. Moreover, as predicted by NPDiscover, the molecule showed no toxicity against healthy human cells. Candida auris is an urgent drug-resistant fungal threat due to its rapid global emergence, high mortality, and persistent transmissions. Infection with Candida auris is associated with high mortality rates, and it is often resistant to multiple classes of antifungal drugs, currently leaving patients with only highly toxic treatment options such as amphotericin. Despite this rapid global spread and resistance, no new antifungal agent has been introduced for this disease since its emergence. These findings signify NPDiscover as a large-scale screening and discovery NRP discovery platform of drug-leads for urgent-threat pathogens and/or orphan diseases. [00281] A machine-learning-empowered platform NPDiscover is generated for discovering novel NRP molecules active against a given set of pathogens of interest. As shown in FIG.19, a process for a NPDiscover pipeline starts from microbial samples, first, (a) genomic DNA is sequenced, and (b) natural product BGCs are extracted. Then (c) Hypothetical chemical structures produced by BGCs are predicted, and (d) their fragmentation pattern in mass spectrometry is predicted. (e) Mass spectra are collected from crude extracts of microbial cultures, and (f) mass spectra are searched against predicted spectra to identify correct structure of natural products. Then the bioactivities of these natural products are predicted. (g) Bioactive molecules with novel chemistries are purified using HPLC- based methods, and (h) their bioactivities and toxicities are characterized. [00282] NPDiscover accurately predicts the hypothetical 2D structure of mature NRPs from their BGCs. NPDiscover starts by automatically annotating the BGCs. It then finds the core NRP products using novel machine learning methods that predict the specificity of amino acids based on identified domains on the BGCs. It then applies enzymatic tailoring modifications (post- assembly modifications) upon the reconstructed core NRPs. We conducted literature mining for 570 NRPs with known gene clusters reported in the MIBiG database30. For each such modification, we aligned all the homologous enzymes (from different gene clusters) that were predicted to be responsible for that modification and constructed a Hidden Markov Model profile. We further stored each such modification in a computer-readable format that includes a chemical motif where the modification occurs (i.e. reaction sites), along with a series of node and bond alterations that tailor the motif to its mature structure. [00283] Using the collected modifications, NPDiscover starts by (i) finding all the graph isomorphisms between the motif of modifications for which the corresponding enzymes are present in the BGC and the generated core NRPs using Ullman algorithm31. Since, NRP BGCs with tailoring enzymes and promiscuous amino can lead to many biosynthesis pathways, NPDiscover uses a novel clustering algorithm to efficiently predict the final structures. After finding the modification enzymes, NPDiscover it clusters the modifications based on the position of the motif on the core NRPs. For each cluster, we use a representative motif to curb the total number of potential molecules without any accuracy, (iii) after this selection, the system iteratively goes through each of these motif sites and consider whether they are altered or not. To evaluate the performance of this novel clustering-based approach in NPDiscover, NRP BGCs in MiBIG database are analyzed using NPDiscover. [00284] To evaluate the performance of NPDiscover for predicting hypothetical NRP structures from NRP BGCs, its genome mining module is compared against antiSMASH and PRISM methods using all NRP BGCs in MiBIG database that resulted in at least one structural prediction by all three programs. Then, for each BGC in this database, we calculated the maximum Tanimoto similarity coefficient between the true and predicted structures. FIG. 22 shows the number of NRP structures predicted by each method at different Tanimoto similarity coefficients. NPDiscover was able to reconstruct 18 molecules out of 145 with Tanimoto coefficient above 90%. In contrast, the PRISM only predicted six NRPs and antiSMASH failed to predict any NRPs, at this threshold. [00285] FIG. 20 shows a graph 2000 comparing accuracy levels in predicting of NRP structure from NRP BGCs by NPDiscover, PRISM, and antiSMASH. The plot shows the distribution of the Tanimoto similarity coefficients between true and predicted structures for the subset of MiBIG NRP BGCs with at least one predicted structure by all three programs (total 145 BGCs). At each point x on X-axis, the Y-axis shows the total number of NRP BGCs for which the maximum Tanimoto coefficient between true and predicted structure exceeds x. NPDiscover shows a three-fold improvement in the number of 2D structures compared to the state-of-the-art (PRISM4) at Tanimoto coefficient threshold 90%. [00286] NPDiscover accurately identifies the mature NRP metabolites with the scalable search of metabolomics data against predicted 2D structures. [00287] NPDiscover's performance is evaluated for matching spectral and genomic data for NRP identification against NRPminer on a dataset of 509 Actinobacteria strains (referred to as Actinobacteria dataset) whose genomics and metabolomics data were available via NCBI and GNPS, respectively. Table 12 shows the list of all known NRP families identified in the metabolomics dataset using the molDiscovery search against PubChem in which BGCs were also identified by antiSMASH analysis against MiBIG BGCs. This table shows that out of six known NRP-BGC pairs in this dataset, NPDiscover, identified five of them correctly without any prior knowledge of them using P-value threshold 10 -15 . NPDiscover failed to identify any of the molecules in Actinomycin NRP family due to its failure in identifying one of the tailoring modifications necessary for predicting its structure (the enzyme involved in this modification was not present in our list of HMM profiles). In contrast, aside from surugamide, NRPminer failed to predict any of the remaining five molecules due to its inability to predict the tailoring modifications. [00288] Table 12 shows six NRP families identified in Actinobacteria dataset. For each NRP family, the identified producer organism and the MiBIG BGC ID is listed. Column "P-value" shows the lowest P-value generated by NPDiscover among all NRP- spectrum matches for that family (without any prior knowledge of them) and "Charge" shows the charge of the spectrum resulting in the listed P-value. Column "Tanimoto" shows the Tanimoto similarity coefficient between the true and the NPDiscover-predicted structure that resulted in the listed P-value. NPDiscover failed to identify any of the NRPs in Actinomycin family. [00289] FIGS. 21, 22 and 23 illustrate the NPDiscover processes 2100, 2200, and 2300 for identifying lipopeptide CDA, glycopeptide Mannopeptimycin, and Cyclomarin NRP families, respectively. FIG. 21 shows process 2100 for identification of CDA in Actinobacteria dataset. (a) CDA BGC annotated by NPDiscover is shown on the top. In additional to the main biosynthesis genes, NPDiscover identified tailoring enzymes corresponding to hydroxylation and macrolactonization in this BGC without any prior knowledge of them. The chemical structures below the BGC track, show how NPDiscover applies the tailoring modifications step-by-step for predicting the final mature CDA molecule. (b) The spectrum matched to the mature CDA structure by NPDiscover. The top ten most intense peaks are shown with their masses. The CDA fragments identified by NPDiscover that are matching these peaks are shown above the spectrum. [00290] FIG. 22 shows process 200 for identification of Mannopeptimycin in Actinobacteria dataset. (left) Mannopeptimycin BGC annotated by NPDiscover is shown at the top. In addition to the main biosynthesis genes, NPDiscover identified genes corresponding to four tailoring enzymes including macrolactonization, methylation, and mannosylation in this BGC. The chemical structures below the BGC track show how NPDiscover applies the tailoring modifications step-by-step for predicting the final mature Mannopeptimycin molecule. (right) Annotation of the spectrum matched to the mature Mannopeptimycin structure by NPDiscover is shown at the bottom. The top ten most intense peaks are shown with their masses. The Mannopeptimycin fragments identified by NPDiscover that are matching these peaks are shown above the spectrum. [00291] FIG. 23 shows a process 2300 for identification of Cyclomarin in Actinobacteria dataset. (left) Cyclomarin BGC annotated by NPDiscover is shown on the top. In addition to the main biosynthesis genes, NPDiscover identified a tailoring enzymes on this BGC shown without any prior knowledge of them. The chemical structures below the BGC track show how NPDiscover applies the tailoring modifications step-by-step for predicting the final mature Cyclomarin molecule. (right) Annotation of the spectrum matched to the mature Cyclomarin structure by NPDiscover. The top ten most intense peaks are shown with their masses. The Cyclomarin fragments identified by NPDiscover that are matching these peaks are shown above the spectrum. [00292] Discovery of novel antifungal natural product CHM-752 active against drug-resistant Candida aureus is performed using the end-to-end platform NPDiscover. In addition to the known NRPs, NPDiscover identified a novel NRP families in Streptomyces sp. NRRL F-2202 in silico search of Actinobacteria metabolomics datasets against 2D structures predicted from novel BGCs in in their paired genomes, with lowest reported P- value 3.1 × 10 -15 across the family. NPDiscover further analyzed the identified NRPs to find those with potential activity against seven microbial pathogens of interest. To do so a database of minimum inhibitory concentration (MIC) levels is collected across 4,803 small molecules against seven bacterial and fungal pathogens, which include Staphylococcus aureus, Pseudomonas aeruginosa, Acinetobacter baumannii, Candida albicans, Klebsiella pneumoniae, Cryptococcus neoformans, and Escheria coli by from various sources37,38. NPDiscover identified one of the predicted structures as potentially active against the Candida pathogen. It identified 13 moieties in this NRP, including one 2,3- dihydroxybenzoyl, three L-Gly, two L-Thr, one D-Ser, one D-Leu, one L-Val, one D-His, one D-Tyr, one D-hydroxy- formyl-ornithine, and one cyclic hydroxy ornithine. Additionally, NPDiscover predicted a blind modification (not predictable from BGC) in this structure with mass 57.02 Da. NPDiscover's machine-learning-based MOA prediction showed that CHM-752 does not represent any of the essential substructures appearing in the known molecules active against Candida, potentially suggesting a novel MOA. [00293] As part of the platform, the m/z and retention time of the matched spectrum are used as reported by NPDiscover to directly isolate and purify the identified molecule. 12mg of CHM-752 is isolated. NMR structure elucidation confirmed that NPDiscover predicted all monomers aside from a single amino acid that was out of 13 moieties (Tyr instead of Gln). FIG. 24 shows the structure elucidated by NMR experiments. The NMR experiments confirmed presence of an additional Gly in the final structure of the molecule compared to the structure predicted by NPDiscover which matched the addition of 57.02 at the modification site predicted by NPDiscover. We hypothesize that one of the Gly- specific adenylation domains is responsible for the activation of two consecutive Gly units in the final mature NRP accounting for the 57.02 mass shift predicted by NPDiscover, suggesting an iterative use of the Gly-incorporating module (similar to stuttering observed in polyketide synthases. [00294] Direct bioactivity screening experiments of CHM-752 are performed against Candida pathogens (Table 13). These experiments showed that CHM-752 exhibits strong antifungal activity against multidrug-resistant Candida pathogens while no toxicity was observed against healthy human cell lines at concentrations up to 64 ug/ml. [00295] Table 13 shows minimum inhibitory concentration (MIC) of CHM-752 against a panel of Candida pathogens (ug/ml). CHM-752 is capable of inhibiting Candida strains that are resistant against Azoles, Echinocandins, and Amphotericin B. CH, CAS, ANI, MCF, FLC, VRC and AMP represent CHM-752, Caspofungin, Anidulafungin, Micafungin, Fluconazole, Voriconazole and Amphotericin B. CHM-752 showed no toxicity against human cell lines at concentrations up to 64ug/ml. [00296] An interpretable machine learning approach to identify mechanism of action of antibiotics [00297] With hundreds of millions of known molecular structures available in molecular libraries, methods for prediction of bioactivity solely based on chemical structure can aid in selecting promising molecules active against targets of interest for downstream bioactivity testing. [00298] One of the main bottlenecks of the existing approaches is that they usually report hundreds/thousands of molecules, where the majority of them possess known mechanisms of action. Overcoming antibiotic resistant pathogens crucially depends on finding small molecules with novel mechanism of action, and currently determining the mechanism of action for small molecules remains an expensive and time-consuming effort. Therefore, it is crucial to develop computational methods for determining molecules with known mechanisms of action and prioritize molecules with novel mechanisms. [00299] Mechanism of action of small molecules are usually linked to their bioactive moieties. One way to extract these moieties would be to find features of the molecule graph that correlate to bioactivity. Methods such as recursive feature elimination, boruta, and lasso have been developed for this purpose, but they are limited to cases where a feature set is available. [00300] Another method to find bioactive moieties is to determine the portion of a molecular graph that a D-MPNN uses to make a prediction. Several heuristic approaches have been developed in order to interpret graph neural networks. One approach is to take the gradient of neural networks with respect to the atoms in the molecular graph and to attribute atoms with more importance if the gradient value for an atom is large. The set of atoms determined to be important by this approach, however, are not necessarily biologically relevant as often a large portion of the molecular graph is flagged as important. Furthermore although gradient methods have had some empirical success, the gradient only represents how the model changes with small perturbations, and high gradient values for atoms do not necessarily mean those atoms are important for classification by a neural network. Another approach for interpreting graph neural networks is to exhaustively search all subgraphs of a molecular graph and find those subgraphs that are either subsets of important nodes as determined by the gradient method or those that do not change the output of the neural network significantly. These methods again often fail in capturing reasonable bioactive moieties as they highlight subgraphs that are common in the molecular space. [00301] Interpreting which substructures are responsible for bioactivity is a challenging problem for the existing algorithms, as there are an exponential number of substructures of molecular graphs, and it is impossible to correctly infer which of these millions of substructures are responsible for activity from a few thousand training points. One way to overcome this issue is to limit the candidate substructures to those that are biologically important, including simple ring structures and functional groups. This knowledge however has rarely been integrated in machine learning methods for drug discovery. [00302] This section of the specification describes an interpretable machine learning model by first identifying the simple ring structures and functional groups in the training data and using them to create binary feature vectors for each molecule where zeros and ones indicate absence/presence of rings and functional groups. Using simple rings and structures as features is advantageous since it is easier to interpret the correlations between these features and mechanism of action in the downstream analysis. Then a logistic regression or extra trees model is trained with balanced scoring on these features in order to create a low complexity model that accounts for imbalanced data. The machine learning model clusters molecules based on their mechanism of action. Moreover, the method can associate a bioactive molecule with its bioactive moiety, providing a strategy for prioritizing molecules with novel mechanism of action. Application of our method to the Community for Open Antimicrobial Drug Discovery (CO-ADD) and a FDA-approved dataset of antibacterial and antifungal bioactivites of several thousand molecules assigned five known mechanism of actions to their moieties. [00303] FIG. 25 shows a process for predicting bioactivity of small molecules. Given (a) a collection of molecules (b) all unique simple ring structures and functional groups are extracted into binary vectors where 0/1 indicates absence/presence of a substructure. Then, (c) extra trees/logistic regression classifier with ℓ1 regularization is trained on the extracted binary features using balanced scoring. Given (d) a query molecule, (e) binary features are extracted, and (f) the trained model is used for predicting bioactivity. [00304] FIG. 26 shows a process 2600 for grouping molecules with similar mechanism of action (MOA) in the following steps. Given (a) a collection of molecules, MOACluster extracts their binary features. Then a logistic regression classifier with ℓ1 regularization and balanced scoring is trained to predict bioactivity, and the model parameters are extracted. (c) MOACluster finds the indices of the top k coefficients and reduces molecule binary features to those k indices. (d) The molecules are clustered according to the reduced binary features. [00305] The machine learning model is trained on two datasets. The first dataset contains molecules from a US Food and Drug (FDA)—approved library, along with 800 natural products isolated from plant, animal, and microbial sources (total of 2335 unique compounds). Data on growth inhibition against Escheria coli is available for all the molecules. The corresponding test data contains growth inhibition of 162 molecules from the Drug Repurposing Hub. Each molecule in the test data is annotated with a mechanism of action by which it fights the disease it was originally purposed for. The second data set, CO-ADD, contains bio-activity data from 4,803 molecules against seven bacterial and fungal pathogens, which include Staphylococcus aureus, Pseudomonas aeruginosa, Acinetobacter baumannii, Candida albicans, Klebsiella pneumoniae, Cryptococcus neoformans, and Escheria coli . For this dataset, 80% of the molecules are randomly selected for training and the rest are allocated for testing. [00306] FIG.27 illustrates a graph 2700 showing a receiver operating characteristic (ROC) curve compared to the approach from Stokes et al. on predicting activity against E. coli. Here 2335 molecules have been used for training, 162 molecules have been used for testing. These test molecules correspond to the portion of the Drug Repurposing Hub for which screening data against E. coli is available (Stokes et al.). The ROC curve is for neural network model from Stokes et al. and InterPred. For false positive rates greater than 0.3, the models have nearly identical true positive rate. [00307] FIG. 28 shows a chart 2800 representing a distribution of the tanimoto similarity between each test data point and their closest neighbor in the training dataset. InterPred achieves nearly the same accuracy as Stokes et al. The area under the curve (AUC) for InterPred is 0.87 while the AUC for Stokes et al. is 0.88. Unlike Stokes et al., InterPred uses fully interpretable features. The average tanimoto similarity between test data points and their closest neighbors is 0.5035 and the standard deviation is 0.18. Only 1.2% of test data points are more than 90% similar to a training data point. [00308] FIG.29 shows the mechanism of action of molecules including at least one of the five most important simple rings according to the logistic regression model. For the majority of molecules with similar bioactive moiety, the mode of action is the same. For example beta-lactam rings (shown in blue), are present in antibiotics such as penicillin and cephalosporin, and they have been reported to prevent cell wall synthesis. The majority of the molecules with this ring are mapped to the cell wall inhibition (G1) mechanism of action. In cases when molecules with the same moiety are mapped to multiple mechanisms of action, those mechanisms of action are usually similar. For example, for cyclohexane (shown in purple) associated mechanisms of action are bacterial 30S ribosomal subunit inhibitor (G3) and protein synthesis inhibitor (G6), both related to inhibiting protein synthesis. For moiety 4-quinolone (shown in green), the associated modes of action are HDAC inhibitor (G18), DNA gyrase inhibitor (G2), and topoisomerase inhibitor (G7), which are all related to inhibition of bacterial nucleic acid synthesis. Molecules containing 4-quinolone are known to inhibit bacterial nucleic acid synthesis by disrupting the enzymes topoisomerase IV and DNA gyrase27. In cases where two molecules contain distinct bioactive moieties, they usually have distinct mechanisms of action. The only exceptions G13 and G17 can be explained by the fact that MAP kinases are a subset of Serine/Threonine Kinases28. Among all the pairs of molecules with the same mechanism of action, 76% are clustered together by MOACluster, and among all the pairs of molecules clustered together by MOACluster, 67.6% have identical and 71% have similar mechanisms of action Mechanism of action of molecules containing at least one of the five most important simple rings according to the logistic regression model. Each of the five rings are highlighted with a different color. Molecules sharing the same mechanism of action, as reported in the Drug Repurposing Hub, are further circled together. For the majority of molecules with similar bioactive moiety, the mode of action is the same. [00309] FIG.30 shows ROC curves of InterPred for prediction of growth inhibition for 7 different bacteria in the CO-ADD dataset. [00310] FIG.31 shows the top 31 ring/functional group features predicted to govern the mechanism of action of molecules along with pathogens they inhibit. The pathogens that are predicted to be inhibited by each moiety are also shown. It has been reported that guanadine and nitro are the bioactive moiety in various antibacterial molecules. Moreover hydrazone/hydarazine have been reported to be potent against S. aureus, A. baumannii, and C. albicans. [00311] InterPred is an interpretable machine learning algorithm for prediction of bioactivity, functional groups responsible for bioactivity, and mechanism of action by training on data. Below we describe various steps of the InterPred algorithm. [00312] Extracting molecular features. Presence of simple rings are extracted using open source package rdkits by finding symmetrized smallest set of smallest rings. Additionally the presence 32 functional groups are extracted by checking whether each molecule has a graph substructure matching the functional group using the descriptors module in RDKit. These substructures are de-duplicated using kekulized canonical SMILES36. Since small molecules have only a few simple rings, feature vectors for each molecules usually only have a few non-zero entries. [00313] Training. Both the extra trees ensemble classifier and logistic regression model with ℓ1 norm regularization are trained on training data and hyper-parameters are optimized via five-fold cross validation. The number of trees in the extra trees model was cross-validated for the numbers 10, 40, 70, 100, 130, and 160. The lambda parameter for ℓ1-regularized logistic regression was cross-validated for values . [00314] In logistic regression and extra trees, the loss function is of the form [00315] where t is used as an index for each training point, y t represents the true label of each molecule in the training dataset, x t represents the features of each molecule, f is a function with range [0,1], and L refers to a loss function that is low when ƒ (x t ) is close to y t and high otherwise. y t takes on value 1 if molecule t inhibits bacterial growth and 0 otherwise. In logistic regression, c is the coefficient vector of logistic regression and , where is a regularization parameter optimized via cross validation. In extra trees ƒ(x t ) is either 0 or 1 and is determined the by the majority label produced by all the trees in the extra trees ensemble. L( ƒ (x t ), y t ) is 0 if ƒ(x t ) and y t are not the same and 1 otherwise. In the training set introduced by Stokes et al., nearly 95% of the molecules do not have antibacterial activity. Such an imbalance could result in misclassification of bioactive molecules as inactive. To avoid this, a “balanced” approach is used. The objective function in (1) is modified to the following: [00316] Where bt is the number of training points with label y t . This way bioactive and inactive molecules will contribute to the training nearly equally. Identifying bioactive moieties. Substructures corresponding to largest positive coefficients of the logistic regression model are reported as bioactive moieties. [00317] Clustering mechanism of action. Since the logistic regression model is trained with ℓ1 regularization, only a few coefficients are non-zero in the model. InterPred algorithm first reduces the feature vector for each molecule to these non-zero features, and then molecules with identical reduced feature vectors are assigned to the same cluster. [00318] Seq2Saccharide: Discovering novel saccharide natural products by integrating computational mass spectrometry and microbial genome mining. [00319] FIG. 32 shows a process 2300 for predicting the structure of candidate saccharides from their biosynthetic gene clusters. (a) DNA sequences are mined to find saccharide encoding BGCs. (b) A database of enzymes required for the biosynthesis of each monomer (monomer pathway) is created. Each monomer is represented with a different shape (tringle, square, circle, or star). Furthermore, the tailoring modification enzymes are identified. The list of all genes required for biosynthesis of each monomer is formed. Various monomer sets (combinations of monomers) are considered per BGC. To test whether monomers in each set are likely to be present in the molecular product of the BGC, the overlap between the required genes and the genes present in that BGC are calculated. (c) A statistical significance score is assigned to each monomer set and its corresponding BGC, and (d) the most likely monomer sets are reported as candidate backbones for each BGC. (e) These backbones are further modified based on the modification enzymes present in the BGC. [00320] FIG. 33 shows a monomer database for Seq2Saccharide. (a) The bond formation mechanisms between monomers, wherein the first reaction, R3 group represents NDP, GDP, UDP, and dTDP; (b) Various monomers with their reaction groups defined in the database are illustrated. For each monomer the reaction sites are highlighted with different colors based on the bond formation mechanism they follow as shown in part a. [00321] FIG. 34 shows a list of saccharide post-assembly modifications collected from known saccharide BGCs from the MIBiG database. The modification sites and reactions are highlighted in red. In each case, the enzyme responsible for the modification is also shown. Modifications requiring more than one enzyme are noted with a plus sign between the enzymes. [00322] FIG. 35 shows enzymatic modifications are stored in a computer readable format for a) aminotransferase enzyme (sisomicin) and (b) dehydratase enzyme (desosamine). Commands disconnect/connect are used (for bonds) and add/remove (for chemical substructures). For example, in part (a), “Disconnect 1 2” removes the double bond between the carbon atom with index 1 and an oxygen atom with index 2. “Remove 2” removes an oxygen atom with index 2, while “add 3 N: 4 H: 5H", followed by "Connect 34 1" and "Connect 351" adds an amine group NH2. Finally, “Connect 131” connects the carbon atom with index 1 with nitrogen atom with index 3 in a single bond. Similarly, in part (b), command “Connect 142” forms a double bond between the carbon atom with index 1 and oxygen atom with index 4. [00323] FIG. 36 shows predicting neomcyin structure. Seq2Saccharide predicts the correct structure of neomycin by mining the genome of Streptomyces sp. ISP-5003, without any prior knowledge of it. Seq2Saccharide identifies three potential enzymes that are involved in the synthesis of 2-deoxystreptamine. In addition to the enzymatic monomers, Seq2Saccharide also considers two enzyme-independent metabolites, glucosamine and ribose which cannot be predicted from the BGC sequences. All possible combinations of these three monomers are considered as the potential backbones for neomycin. Furthermore, Seq2Saccharide identified the tailoring enzymes involved in glycosylation, ribosylation, dehydrogenation, hexosaminylation and deacetylation of neomycin backbone. [00324] FIG. 37 shows identifying neomcyin by searching the spectral dataset generated from the extracts of Streptomyces sp. ISP- 5003 against all the hypothetical structures predicted from each saccharide BGC. The fragments matching a peak in the neomcyin spectrum are shown at the top and the matched spectrum identified by Seq2Saccharide is illustrated at the bottom. Fragments are generated either by removing one bond (depth 1) or two bonds (depth 2) from the original structure. The red (green) structures and peaks correspond to neomycin fragments generated at depth 1 (2). [00325] FIG. 38 shows a graph 3800 benchmarking Seq2saccharide algorithm against Prism4 on 36 known saccharides from the MiBIG dataset. At Tanimoto threshold of 95%, Seq2saccharide correctly constructs four out of 42 known saccharides, while Prims4 cannot correctly identify any. At Tanimoto threshold of 80%, Seq2saccharide correctly predicts 10 out of 42 known saccharides, while Prims4 predicts five. [00326] Seq2Saccharide algorithm including the following steps detailed below (FIG. 32): (a) identifying saccharide BGCs, (b) annotating the BGCs with monomer pathways genes and finding modification enzymes, (c) predicting the monomer set, (d) predicting the hypothetical structure of saccharides by applying post-assembly modifications to the backbone based on identified enzymes, and (e) identifying novel saccharide by error-tolerant search of mass spectra against the genomic-driven hypothetical structures. Identifying saccharide BGCs. Saccharide BGCs are identified using Minowa and antiSMASH. [00327] Annotating the BGCs with monomer pathways genes. A database of 30 saccharide monomers have been constructed by literature mining. Using this database, three general enzymatic mechanisms for bond formation in Saccharide were identified (FIG.33). Moreover, a database of enzymes required for the biosynthesis of each monomer (monomer pathway) has been constructed, in the Hidden Markov Model profile format. Each BGC is annotated by all the genes present in these monomer pathways (FIG. 32) predicting the monomer sets. To test whether a set of monomers are likely present in a molecular product, Seq2Saccharide uses an approach similar to the gene set enrichment analysis. For each possible set of monomers, list of all genes required for the synthesis of the monomers in that set are formed. Then the hypothesis that these required genes significantly overlap with the genes present in the BGC is tested. A statistical significance score is assigned to each pair of monomer sets and BGCs, and the most likely monomer sets are reported as candidate backbones (FIG.32). [00328] Applying the post-assembly modifications. A database of 51 unique post- assembly modifications was mined from the gene clusters of 36 saccharides reported in the MIBiG database (FIG. 34). For each unique modification, Seq2Saccharide finds all the homologous enzymes that were predicted to drive that modification, to construct a Hidden Markov Model profile. Each modification is stored in a computer-readable format that includes a chemical motif where the modification happens, along with a series of node and bond alterations that tailor the motif to its mature structure (FIG. 35). Using the database of tailoring enzymes/modifications Seq2saccharide predicts all the hypothetical mature molecules through the following steps (FIG.32) : (i) annotating the modification enzymes in the BGC, and (ii) finding all the graph isomorphisms between the motifs corresponding to enzymes present in the BGCs and the backbone structures, using Ullman algorithm. Then, (iii) iteratively selecting each of these motifs and considering whether they are altered or not (a total of 2! hypothetical mature molecules, where & is the number of motifs present in the backbone structure). [00329] FIG. 39 shows an example process 3900 for identifying a natural product drug from microbial samples. Process 3900 shows natural product identification in the context of drug discovery. For example, microbial samples are collected (3902). To identify natural products from these microbial samples, the samples are sequenced with multi-omics (3904) processes. These processes generate genomic data including sequencing information and/or mass spectra of the samples. The sequences and mass spectra of the microbial samples are input into the data processing system 102. The data processing system 102 at step 3906 is configured to determine the structures of the small molecules represented in the mass spectra based on the processes described above. Candidate structures can be purified (3908) and tested (3910) against various infections. A resulting drug can be determined as an output (3912). [00330] FIG. 40 shows a flow diagram representing a process 4000 for identifying the structures of molecular compounds from mass spectrometry data. The process 4000 can be executed by a computing system, such as the computing system described in relation to FIG. 41. The process 4000 includes receiving (4002) data representing gene clusters, the gene clusters including one or more genes configured to encode one or more polypeptides or other small molecules. The process 4000 includes accessing (4004) a machine learning model, the machine learning model being trained with a training dataset that associates the gene clusters to structures of one or more small molecules represented in the data. The process 4000 includes applying (4006) the machine learning model to the data representing the gene clusters. The process 4000 includes identifying (4008), based on applying the machine learning model, one or more monomers associated with at least one gene cluster represented in the data. The process 4000 includes determining (4010) a structure for a natural product including the one or more monomers. [00331] In some implementations, the machine learning model is trained by performing operations comprising: accessing a set of hypothetical structures for natural products including the structure for the natural product; generating a set of random structures of molecules, the random structures including small molecules; testing, using mass spectrometry data representing known structures and the set of random structures, the set of hypothetical structures for the natural products including the structure for the natural product; generating a score for the structure, the score indicating a match between the structure and a known structure represented in the mass spectrometry data; filtering, based on the score, one or more hypothetical structures from the set of hypothetical structures to generate a filtered set of hypothetical structures that includes the structure for the natural product; and generating the training dataset for training the machine learning model, the training dataset including the filtered set of hypothetical structures. [00332] In some implementations, the process 4000 includes determining a class associated with the gene clusters; and accessing, based on the class, a set of training data that is specific to the class associated with the gene clusters. [00333] In some implementations, the process 4000 includes predicting a biological activity of an identified natural product based on the machine learning model that is trained with the training dataset. In some implementations, the process 4000 includes generating, based on predicting the activity, a data library comprising data that associates a gene cluster with a respective biological activity. In some implementations, the process 4000 includes purifying the natural product based on the determined structure. [00334] In some implementations, determining the structure for the natural product including the one or more monomers comprises: predicting, based on the one or more monomers that are identified, a core molecule that is assembled by combining a group of monomers; determining one or more particular gene clusters represented in the data that cause a change of a structure of the core molecule; and identifying an enzyme associated with one or more particular gene clusters that cause the change to the structure of the core molecule. [00335] In some implementations, the core molecule includes a peptide, and wherein the change comprises an addition of an amino acid. [00336] In some implementations, the change comprises an addition of a lipid tail to the core molecule. [00337] In some implementations, the change comprises an addition of a monomer to the core molecule. [00338] In some implementations, the data representing the gene clusters comprises one or more data signatures, wherein data signatures comprise a location of a gene cluster with respect to one or more other gene clusters; and wherein determining the structure for a natural product including the one or more monomers is based on the data signatures. [00339] In some implementations, the core molecule includes a peptide, and wherein the change comprises an addition of an amino acid. [00340] In some implementations, the core molecule includes a non-ribosomal peptide. [00341] In some implementations, the core molecule includes a ribosomally synthesized and post-translationally modified peptide. [00342] In some implementations, the core molecule includes a polyketide. [00343] In some implementations, the core molecule includes a saccharide or aminoglycoside. [00344] In some implementations, the change comprises an addition of a lipid tail to the core molecule. [00345] In some implementations, the core molecule includes a hybrid of non- ribosomal peptide and/or ribosomally synthesized and post-translationally modified peptide and/or a polyketide and/or a saccharide or aminoglycoside, and wherein the change comprises an addition of a monomer. [00346] In some implementations, the data representing the gene clusters comprises one or more data signatures, wherein data signatures comprise a location of a gene cluster with respect to one or more other gene clusters; and wherein determining the structure for a natural product including the one or more monomers is based on the data signatures. [00347] In some implementations, structures of predicted molecules are stored in a computer format that allows for accelerated search against mass spectra. [00348] FIG.41 shows an example computer system 4100 that includes a processor 4100, a memory 4120, a storage device 4130 and an input/output device 4140. Each of the components 4100, 4120, 4130 and 4140 can be interconnected, for example, by a system bus 4150. The processor 4110 is capable of processing instructions for execution within the system 4100. In some implementations, the processor 4100 is a single-threaded processor, a multi-threaded processor, or another type of processor. The processor 4100 is capable of processing instructions stored in the memory 4120 or on the storage device 4130. The memory 4120 and the storage device 4130 can store information within the system 4100. [00349] The input/output device 4140 provides input/output operations for the system 4100. In some implementations, the input/output device 4140 can include one or more of a network interface device, e.g., an Ethernet card, a serial communication device, e.g., an RS-232 port, and/or a wireless interface device, e.g., an 802.11 card, a 3G wireless modem, a 4G wireless modem, a 5G wireless modem, etc. In some implementations, the input/output device can include driver devices configured to receive input data and send output data to other input/output devices, e.g., keyboard, printer and display devices 4160. In some implementations, mobile computing devices, mobile communication devices, and other devices can be used. [00350] While this specification includes many specific implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular implementations. Certain features that are described in this specification in the context of separate implementations can also be implemented, in combination, in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations, separately, or in any suitable sub-combination. Moreover, although previously described features may be described as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can, in some cases, be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination. [00351] Particular implementations of the subject matter have been described. Other implementations, alterations, and permutations of the described implementations are within the scope of the following claims as will be apparent to those skilled in the art. While operations are depicted in the drawings or claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed (some operations may be considered optional), to achieve desirable results. [00352] Moreover, the separation or integration of various system modules and components in the previously described implementations should not be understood as requiring such separation or integration in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. [00353] Accordingly, the previously described example implementations do not define or constrain the present disclosure. Other changes, substitutions, and alterations are also possible without departing from the scope of the present disclosure.