Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS
- PMID: 40410407
- DOI: 10.1038/s41587-025-02663-3
Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS
Abstract
Characterizing biological and environmental samples at a molecular level primarily uses tandem mass spectroscopy (MS/MS), yet the interpretation of tandem mass spectra from untargeted metabolomics experiments remains a challenge. Existing computational methods for predictions from mass spectra rely on limited spectral libraries and on hard-coded human expertise. Here we introduce a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated tandem mass spectra from our GNPS Experimental Mass Spectra (GeMS) dataset mined from the MassIVE GNPS repository. We show that pre-training our model to predict masked spectral peaks and chromatographic retention orders leads to the emergence of rich representations of molecular structures, which we named Deep Representations Empowering the Annotation of Mass Spectra (DreaMS). Further fine-tuning the neural network yields state-of-the-art performance across a variety of tasks. We make our new dataset and model available to the community and release the DreaMS Atlas-a molecular network of 201 million MS/MS spectra constructed using DreaMS annotations.
© 2025. The Author(s).
Conflict of interest statement
Ethics and inclusion statement: All co-authors of this publication meet the authorship criteria outlined by Nature Portfolio journals, as detailed in the ‘Author contributions’. The authors have complied with the inclusion and ethics guidelines of the Nature Portfolio journals. Competing interests: T.P. is a co-founder of mzio GmbH, which develops technologies related to mass spectrometry data processing. The other authors declare no competing interests.
References
Grants and funding
- 891397/EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Sklodowska-Curie Actions (H2020 Excellent Science - Marie Sklodowska-Curie Actions)
- 101097822/EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 European Research Council (H2020 Excellent Science - European Research Council)
- 101120237/EC | Horizon 2020 Framework Programme (EU Framework Programme for Research and Innovation H2020)
LinkOut - more resources
Full Text Sources