Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 30;11(12):1793.
doi: 10.3390/biom11121793.

MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra

Affiliations

MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra

Aditya Divyakant Shrivastava et al. Biomolecules. .

Abstract

The 'inverse problem' of mass spectrometric molecular identification ('given a mass spectrum, calculate/predict the 2D structure of the molecule whence it came') is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem ('calculate a small molecule's likely fragmentation and hence at least some of its mass spectrum from its structure alone') is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the 'translation' a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the 'true' molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are 'similar' to the top hit. In addition to using the 'top hits' directly, we can produce a rank order of these by 'round-tripping' candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower, including those in the last CASMI challenge (for which the results are known), getting 49/93 (53%) precisely correct. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. It seems to act as a Las Vegas algorithm, in that it either gives the correct answer or simply states that it cannot find one. The ability to create and to 'learn' millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra.

Keywords: artificial intelligence; chemical space; deep learning; electrospray; generative methods; mass spectrometry; metabolomics; transformers.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflict of interest.

Figures

Figure 1
Figure 1
The overall strategy behind MassGenie. (A). The overall architecture and structure of the transformer used in MassGenie, our deep learning system for identifying molecules from their mass fragments (spectra). (B). The three basic strategies available to us for relating mass spectral peak lists to the molecules whence they might have come. (1) The transformer outputs only a single molecule. (2) We can take a series of candidate molecules, generate candidate mass spectra in silico using FragGenie, and compare them with the experimental mass spectra, using cosine similarity to rank order the candidates. (3) We can use VAE-Sim to generate further candidate molecules that are ‘close’ in chemical space (and possess the correct molecular formula) and rank those as in (2). This is done for both FragGenie-generated spectra (C) and experimental mass spectra (D).
Figure 1
Figure 1
The overall strategy behind MassGenie. (A). The overall architecture and structure of the transformer used in MassGenie, our deep learning system for identifying molecules from their mass fragments (spectra). (B). The three basic strategies available to us for relating mass spectral peak lists to the molecules whence they might have come. (1) The transformer outputs only a single molecule. (2) We can take a series of candidate molecules, generate candidate mass spectra in silico using FragGenie, and compare them with the experimental mass spectra, using cosine similarity to rank order the candidates. (3) We can use VAE-Sim to generate further candidate molecules that are ‘close’ in chemical space (and possess the correct molecular formula) and rank those as in (2). This is done for both FragGenie-generated spectra (C) and experimental mass spectra (D).
Figure 1
Figure 1
The overall strategy behind MassGenie. (A). The overall architecture and structure of the transformer used in MassGenie, our deep learning system for identifying molecules from their mass fragments (spectra). (B). The three basic strategies available to us for relating mass spectral peak lists to the molecules whence they might have come. (1) The transformer outputs only a single molecule. (2) We can take a series of candidate molecules, generate candidate mass spectra in silico using FragGenie, and compare them with the experimental mass spectra, using cosine similarity to rank order the candidates. (3) We can use VAE-Sim to generate further candidate molecules that are ‘close’ in chemical space (and possess the correct molecular formula) and rank those as in (2). This is done for both FragGenie-generated spectra (C) and experimental mass spectra (D).
Figure 2
Figure 2
Illustration of the means by which MassGenie can predict a series of candidate molecules. In this case, the ‘true’ molecule is shown on the left, and 18 candidate molecules (from 100 runs) shown on the right. It is clear that all are close isomers, containing a methoxybenzoate moiety linked via a secondary amine to a pyrazole ring with a trifluoromethyl substituent.
Figure 3
Figure 3
Analysis of a test set of predictions of the molecules behind 1350 in silico-fragmented mass spectral peaks. The peak lists were passed through the transformer after it had been trained and fine-tuned as described in methods. The analysis was done single-blind and the results fed back. The ‘closeness’ between the molecule estimated and the true molecule is given as the highest Tanimoto similarity based on six encodings. Four estimated molecules with a TS of ~0.9 are illustrated, together (at left) with the ‘true’ molecules. It is again obvious that they are extremely close structurally.
Figure 4
Figure 4
Local candidate structures generated by VAE-Sim. Based on 10 molecules taken from Figure 3, where the Tanimoto similarity between the best hit and the true molecule was in the range 0.9–0.95. VAE-Sim increases the number of candidates, in many cases improving their closeness to the true molecule.
Figure 5
Figure 5
Local search where mass predictions are inaccurate. (A). All examples from Figure 3 in which the transformer predicted molecules with slightly incorrect masses. VAE-Sim was used to search locally and generate further candidate structures that were round-tripped using FragGenie to produce mass spectra that could again be ranked in terms of TS to the known, true molecule given either the optimal mass difference or the optimal TS. (B). Three examples showing the ground truth, the transformer’s best estimate, and the best (and accurate) prediction after the candidate pool of molecules was enhanced using VAE-Sim.
Figure 6
Figure 6
Illustration of the predictive power of MassGenie when presented with an experimental peak list (363.21631, 121.06477, 105.06985, 97.0648, 91.05423, 327.19522, 119.08553, 309.18469, 109.06478, 145.10121, 93.06989, 131.08549, 123.08039, 79.0542, 143.08563) of the true molecule cortisol, along with its canonicalized best proposals. In this case 2/4 are correct (and to a chemist’s eye far more biologically plausible).

References

    1. Griffin J.L. The Cinderella story of metabolic profiling: Does metabolomics get to go to the functional genomics ball? Philos. Trans. R. Soc. Lond. B Biol. Sci. 2006;361:147–161. doi: 10.1098/rstb.2005.1734. - DOI - PMC - PubMed
    1. Oliver S.G., Winson M.K., Kell D.B., Baganz F. Systematic functional analysis of the yeast genome. Trends Biotechnol. 1998;16:373–378. doi: 10.1016/S0167-7799(98)01214-1. - DOI - PubMed
    1. Dunn W.B., Broadhurst D., Begley P., Zelena E., Francis-McIntyre S., Anderson N., Brown N., Knowles J., Halsall A., Haselden J.N., et al. The Husermet consortium, Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat. Protoc. 2011;6:1060–1083. doi: 10.1038/nprot.2011.335. - DOI - PubMed
    1. Dunn W.B., Erban A., Weber R.J.M., Creek D.J., Brown M., Breitling R., Hankemeier T., Goodacre R., Neumann S., Kopka J., et al. Mass Appeal: Metabolite identification in mass spectrometry-focused untargeted metabolomics. Metabolites. 2013;9:S44–S66. doi: 10.1007/s11306-012-0434-4. - DOI
    1. Arús-Pous J., Awale M., Probst D., Reymond J.L. Exploring Chemical Space with Machine Learning. Chimia. 2019;73:1018–1023. doi: 10.2533/chimia.2019.1018. - DOI - PubMed

Publication types

Substances