Review

. 2022 Dec 5;18(12):103.

doi: 10.1007/s11306-022-01963-y.

Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools

Niek F de Jonge¹, Kevin Mildau², David Meijer¹, Joris J R Louwen¹, Christoph Bueschl², Florian Huber³, Justin J J van der Hooft^{4

5}

Affiliations

¹ Bioinformatics Group, Wageningen University, Wageningen, the Netherlands.
² Department of Analytical Chemistry, Biochemical Network Analysis Lab, University of Vienna, Vienna, Austria.
³ Centre for Digitalization and Digitality (ZDD), University of Applied Sciences Düsseldorf, Düsseldorf, Germany.
⁴ Bioinformatics Group, Wageningen University, Wageningen, the Netherlands. justin.vanderhooft@wur.nl.
⁵ Department of Biochemistry, University of Johannesburg, Johannesburg, South Africa. justin.vanderhooft@wur.nl.

PMID: 36469190
PMCID: PMC9722809
DOI: 10.1007/s11306-022-01963-y

Review

Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools

Niek F de Jonge et al. Metabolomics. 2022.

. 2022 Dec 5;18(12):103.

doi: 10.1007/s11306-022-01963-y.

Authors

Niek F de Jonge¹, Kevin Mildau², David Meijer¹, Joris J R Louwen¹, Christoph Bueschl², Florian Huber³, Justin J J van der Hooft^{4

5}

Affiliations

¹ Bioinformatics Group, Wageningen University, Wageningen, the Netherlands.
² Department of Analytical Chemistry, Biochemical Network Analysis Lab, University of Vienna, Vienna, Austria.
³ Centre for Digitalization and Digitality (ZDD), University of Applied Sciences Düsseldorf, Düsseldorf, Germany.
⁴ Bioinformatics Group, Wageningen University, Wageningen, the Netherlands. justin.vanderhooft@wur.nl.
⁵ Department of Biochemistry, University of Johannesburg, Johannesburg, South Africa. justin.vanderhooft@wur.nl.

PMID: 36469190
PMCID: PMC9722809
DOI: 10.1007/s11306-022-01963-y

Abstract

Background: Untargeted metabolomics approaches based on mass spectrometry obtain comprehensive profiles of complex biological samples. However, on average only 10% of the molecules can be annotated. This low annotation rate hampers biochemical interpretation and effective comparison of metabolomics studies. Furthermore, de novo structural characterization of mass spectral data remains a complicated and time-intensive process. Recently, the field of computational metabolomics has gained traction and novel methods have started to enable large-scale and reliable metabolite annotation. Molecular networking and machine learning-based in-silico annotation tools have been shown to greatly assist metabolite characterization in diverse fields such as clinical metabolomics and natural product discovery.

Aim of review: We highlight recent advances in computational metabolite annotation workflows with a special focus on their evaluation and comparison with other tools. Whilst the progress is substantial and promising, we also argue that inconsistencies in benchmarking different tools hamper users from selecting the most appropriate and promising method for their research. We summarize benchmarking strategies of the different tools and outline several recommendations for benchmarking and comparing novel tools.

Key scientific concepts of review: This review focuses on recent advances in mass spectral library-based and machine learning-supported metabolite annotation workflows. We discuss large-scale library matching and analogue search, the current bloom of mass spectral similarity scores, and how molecular networking has changed the field. In addition, the potentials and challenges of machine learning-supported metabolite annotation workflows are highlighted. Overall, recent developments in computational metabolomics have started to fundamentally change metabolomics workflows, and we expect that as a community we will be able to overcome current method performance ambiguities and annotation bottlenecks.

Keywords: Benchmarking; Machine learning; Mass fragmentation spectra; Mass spectrometry; Metabolite annotation and identification; Untargeted metabolomics.

PubMed Disclaimer

Conflict of interest statement

Justin J.J. van der Hooft is a member of the Scientific Advisory Board of NAICONS Srl, Milano, Italy. All other authors declare no conflict of interests.

Figures

**Fig. 1**
Overview of different spectral comparison (**a–c**) and spectral organisation methods (d) for two MS/MS spectra A and B. a1) Using mass spectral binning (i.e., to account for small *m/z* value differences), mass fragmentation spectra are transformed into vectors that are subsequently compared using mathematical formulas. a2) Modifications of the binning schema can account for other differences than *m/z* values (e.g., account for neutral losses, use only fragments present in both spectra, etc.). a3) Besides the actual mass fragment signals, neutral losses within or between spectra alone can serve as input for the spectral comparisons. a4) The Entropy score is a recently developed and high-performing metric for spectral comparisons. b1) Spectral comparison can be based on automatically computer-learned representations (i.e., alternatives to fragment spectral binning). b2) Comparison of MS/MS spectra can be achieved automatically with machine/deep learning methods and thus correlate better with structural similarity (NN: Neural Networks, SVM: Support Vector Machines). c Fragment spectra can be “aligned” similar to sequence alignment, which will report sub-spectra with overlapping fragments (i.e., certain structure parts of the two molecules, SIMILE: Significant Interrelation of MS/MS Ions via Laplacian Embedding). d Many MS/MS spectra can be organised into groups (molecular networking or mass spectral networking) or embedded in a lower subspace (a proxy for structural similarity)

**Fig. 2**
Illustration of reference library imbalances with respect to chemical classes, instrument types, and annotation rates by precursor mass. These factors may affect machine learning training dataset quality and representativeness. a ClassyFire classes of all 24,101 unique structures from the positive ionisation mode MS/MS spectra in GNPS. Chemical compound classes were determined by using ClassyFire superclasses (Djoumbou Feunang et al., 2016). For simplicity, classes are numbered from most to least occurring. b Instrument types for the 314,318 positive ionisation mode spectra in GNPS. Instrument type names were simplified to the ones shown in the figure. c Parent mass distributions of the 314,318 positive ionisation mode spectra in GNPS, the 13,908 positive ionisation mode spectra in GNPS that had no annotated SMILES, and the 9129 spectra in the dataset used by Crüsemann et al. (2015). Matchms was used to process the mgf files in the same way as in MS2DeepScore; here, MS/MS spectra with at least one fragment peak and a parent mass were considered

**Fig. 3**
Two main machine learning (ML) based strategies applied today to link MS/MS spectra to molecules. Strategy 1 describes embedding-based library searches whereby chemically most related substances in a library are identified through comparisons of abstract embeddings of library molecules (step 1). This library can be expanded by including *in-silico* generated MS/MS spectra (step 2). Strategy 2 describes de novo structure elucidation directly from MS/MS spectra, circumventing any database comparison

**Fig. 4**
Benchmarking of MS2Deepscore with different types of test sets. In all figures the RMSE is determined separately for 10 Tanimoto score bins, followed by taking the average over these 10 bins. a RMSE of MS2Deepscore on test sets with 1500 spectra within a molecular mass range. b RMSE of MS2Deepscore on test sets with 1500 spectra of the most abundant ClassyFire superclasses. c Visualisation of the variance for different test set sizes. This shows there is a substantial difference between smaller test sets of 100 spectra

See this image and copyright information in PMC

References

1. Aisporna A, Benton HP, Chen A, Derks RJE, Galano JM, Giera M, Siuzdak G. Neutral loss mass spectral data enhances molecular similarity analysis in METLIN. Journal of the American Society for Mass Spectrometry. 2022;33:530–534. doi: 10.1021/jasms.1c00343. - DOI - PMC - PubMed
1. Alseekh S, Aharoni A, Brotman Y, Contrepois K, D'Auria J, Ewald J, Fraser PD, Giavalisco P, Hall RD, Heinemann M, Link H, Luo J, Neumann S, Nielsen J, Perez de Souza L, Saito K, Sauer U, Schroeder FC, Schuster S, et al. Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices. Natural Methods. 2021;18:747–756. doi: 10.1038/s41592-021-01197-1. - DOI - PMC - PubMed
1. Aron AT, Gentry EC, McPhail KL, Nothias L-F, Nothias-Esposito M, Bouslimani A, Petras D, Gauglitz JM, Sikora N, Vargas F, van der Hooft JJJ, Ernst M, Kang KB, Aceves CM, Caraballo-Rodríguez AM, Koester I, Weldon KC, Bertrand S, Roullier C, et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS. Nature Protocols. 2020;15:1954–1991. doi: 10.1038/s41596-020-0317-5. - DOI - PubMed
1. Bach, E., Schymanski, E. L., & Rousu, J. (2022) Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data. bioRxiv.
1. Baraniuk R, Donoho D, Gavish M. The science of deep learning. Proceedings of the National Academy of Sciences USA. 2020;117:30029–30032. doi: 10.1073/pnas.2020596117. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools

Affiliations

Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials