Review

. 2025 Jan;417(3):473-493.

doi: 10.1007/s00216-024-05471-x. Epub 2024 Aug 14.

Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening

Henrik Hupatz^#^{1

2}, Ida Rahu^#³, Wei-Chieh Wang¹, Pilleriin Peets⁴, Emma H Palm⁵, Anneli Kruve^{6

7

8}

Affiliations

¹ Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden.
² Stockholm University Center for Circular and Sustainable Systems (SUCCeSS), Stockholm University, 106 91, Stockholm, Sweden.
³ Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden. ida.rahu@mmk.su.se.
⁴ Institute of Biodiversity, Faculty of Biological Science, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743, Jena, Germany.
⁵ Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 Avenue du Swing, 4367, Belvaux, Luxembourg.
⁶ Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden. anneli.kruve@su.se.
⁷ Stockholm University Center for Circular and Sustainable Systems (SUCCeSS), Stockholm University, 106 91, Stockholm, Sweden. anneli.kruve@su.se.
⁸ Department of Environmental Science, Stockholm University, Svante Arrhenius Väg 8, 114 18, Stockholm, Sweden. anneli.kruve@su.se.

^# Contributed equally.

PMID: 39138659
PMCID: PMC11700063
DOI: 10.1007/s00216-024-05471-x

Review

Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening

Henrik Hupatz et al. Anal Bioanal Chem. 2025 Jan.

. 2025 Jan;417(3):473-493.

doi: 10.1007/s00216-024-05471-x. Epub 2024 Aug 14.

Authors

Henrik Hupatz^#^{1

2}, Ida Rahu^#³, Wei-Chieh Wang¹, Pilleriin Peets⁴, Emma H Palm⁵, Anneli Kruve^{6

7

8}

Affiliations

¹ Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden.
² Stockholm University Center for Circular and Sustainable Systems (SUCCeSS), Stockholm University, 106 91, Stockholm, Sweden.
³ Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden. ida.rahu@mmk.su.se.
⁴ Institute of Biodiversity, Faculty of Biological Science, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743, Jena, Germany.
⁵ Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 Avenue du Swing, 4367, Belvaux, Luxembourg.
⁶ Department of Materials and Environmental Chemistry, Stockholm University, Svante Arrhenius Väg 16, 114 18, Stockholm, Sweden. anneli.kruve@su.se.
⁷ Stockholm University Center for Circular and Sustainable Systems (SUCCeSS), Stockholm University, 106 91, Stockholm, Sweden. anneli.kruve@su.se.
⁸ Department of Environmental Science, Stockholm University, Svante Arrhenius Väg 8, 114 18, Stockholm, Sweden. anneli.kruve@su.se.

^# Contributed equally.

PMID: 39138659
PMCID: PMC11700063
DOI: 10.1007/s00216-024-05471-x

Abstract

Non-targeted screening with liquid chromatography coupled to high-resolution mass spectrometry (LC/HRMS) is increasingly leveraging in silico methods, including machine learning, to obtain candidate structures for structural annotation of LC/HRMS features and their further prioritization. Candidate structures are commonly retrieved based on the tandem mass spectral information either from spectral or structural databases; however, the vast majority of the detected LC/HRMS features remain unannotated, constituting what we refer to as a part of the unknown chemical space. Recently, the exploration of this chemical space has become accessible through generative models. Furthermore, the evaluation of the candidate structures benefits from the complementary empirical analytical information such as retention time, collision cross section values, and ionization type. In this critical review, we provide an overview of the current approaches for retrieving and prioritizing candidate structures. These approaches come with their own set of advantages and limitations, as we showcase in the example of structural annotation of ten known and ten unknown LC/HRMS features. We emphasize that these limitations stem from both experimental and computational considerations. Finally, we highlight three key considerations for the future development of in silico methods.

Keywords: Generative modeling; Machine learning; Non-targeted analysis; Non-targeted screening; Suspect screening; Untargeted screening.

PubMed Disclaimer

Conflict of interest statement

Declarations. Conflict of interest: The authors declare no competing interests.

Figures

**Fig. 1**
Experimental workflow for analyzing an environmental sample using LC/IM/HRMS experiment with electrospray ionization (ESI). Dark brown indicates experimental analytical features (RT t_I, *CCS a*_I, and *m/z m*_I) of an unknown structure, while light brown marks its MS² features (p_I1, p_I2, and p_I3). *CCS* values are derived from arrival time distributions (ATD). The schematical table with analytical information will appear in subsequent figures to highlight the in silico structural annotation workflow for LC/HRMS features

**Fig. 2**
In silico approaches for retrieving candidate structures, depicted as SMILES (simplified molecular input line entry system) notations, from MS² spectra. The shown MS² data are arbitrarily generated and do not correspond to any specific LC/HRMS feature or structure. Brown arrows indicate that candidate structures for the same LC/HRMS feature can be obtained with all four approaches. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently in the following figures

**Fig. 3**
Uniform Manifold Approximation and Projection (UMAP) plots illustrating the chemical space coverage of datasets widely used for LC/HRMS feature annotation (MassBank [12] and SIRIUS [26]) and for training ML models (RTI [34] and CCSBase v1.2 [35]) applied for predicting empirical analytical information used to prioritize candidate structures. The latent space of all relevant chemicals in environmental analysis was learned based on the SIRIUS+CSI:FingerID positive mode fingerprint (3878 bits) calculated from the SMILES representation of 370,167 chemicals in the PubChemLite 0.3.0 dataset. The resulting UMAP embedding was applied to all the datasets (4310 chemicals from MassBank, 21,188 chemicals from SIRIUS+CSI:FingerID positive mode training data, 1426 chemicals from RTI training data, and 4771 chemicals from CCSBase training data). For additional details, refer to Supplementary Information 1 (SI1) Section S6

**Fig. 4**
Training strategies employed by various GMs developed for candidate structure generation based on HRMS data, addressing the sparsity of training data. MassGenie [31] and MS2Mol [32] (blue) employ in silico and experimental databases for training. MSNovelist [33] (brown) is trained on the molecular fingerprints of chemicals from structural databases. The decoders of Spec2Mol [30] and JTVAE [29] (violet) are pre-trained on SMILES-to-SMILES translation. Mass2SMILES [28] (red) utilizes only experimental databases. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently throughout the figures of the manuscript

**Fig. 5**
Computational workflow illustrating the training process of an empirical analytical information (EAI) prediction model using RT as an example. The model is trained by utilizing molecular fingerprints and/or descriptors, followed by empirical analytical information prediction for candidate structures. The brown arrow indicates that retention times can be predicted for each candidate structure. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently throughout the figures of the manuscript

**Fig. 6**
Three chemicals sharing the same molecular formula (C₁₀H₁₀O₄) can exhibit distinct retention times and be detectable with different ESI modes, influenced by their polarity and acid–base properties. A The peak corresponding to dimethyl phthalate (violet) is magnified by a factor of 10 × for enhanced visibility of other chromatographic peaks. B Additionally, adduct formation and in-source fragmentation may offer supplementary insights into the localization of functional groups

**Fig. 7**
Visualization of the structural annotation and candidate structure prioritization results for the six LC/HRMS features out of the 20 LC/HRMS features studied (remaining features are provided in the S11 Section S8). A The number of candidate structures obtained from experimental and in silico spectra matching with MassBank and MetFrag, and by employing SIRIUS+CSI:FingerID and Spec2Mol. Each candidate structure is represented by a colored circle, with the order indicating its rank within the annotation approach. Dual-colored circles represent candidate structures suggested by two methods. The middle panel illustrates the number of candidate structures prioritized based on predicted RT obtained by utilizing the RTI model and *CCS* obtained by employing the CCSbase model. For features corresponding to the spiked chemicals, the correct structure is highlighted with a dark blue exclamation mark. B Visualization of the candidate structures in the chemical space using the UMAP embedding of PubChemLite (Fig. 3). All points are transparent, resulting in a darker color when data points are overlaid

**Fig. 8**
A Heatmap illustrating the structural similarities among candidates suggested by four methods employed for the structural annotation of 20 LC/HRMS features (SI1 Section S7). Each small colored square represents the similarity of candidate structures, calculated as the average of all pairwise Tanimoto similarities between all the suggested candidates within one LC/HRMS feature. The LC/HRMS features are sorted based on their *m/z* values. Brown indicates higher similarity, while green indicates lower similarity among candidate structures (the white midpoint of the colorbar (0.22) denotes the average similarity across all the suggested candidate structures, calculated as the mean using all the pairwise Tanimoto similarities of candidates). Light blue indicates that a specific LC/HRMS feature did not yield any candidate structures from a particular method. Numbers inside the larger squares represent the overall average similarity scores within or between the methods. B Experimentally obtained retention time (RT) values for 20 LC/HRMS features plotted against the predicted RT values from the RTI model for all candidate structures corresponding to each LC/HRMS feature. Candidates prioritized using the cutoff criterion of ± 2 standard residuals are highlighted in light brown. C Experimentally obtained *CCS* values for 20 LC/HRMS features plotted against the predicted *CCS* from the CCSbase model for all candidate structures corresponding to each LC/HRMS feature. Candidates prioritized using the criterion of difference between predicted and experimental values less than 3% are highlighted in light turquoise

See this image and copyright information in PMC

Cited by

Do experimental projection methods outcompete retention time prediction models in non-target screening? A case study on LC/HRMS interlaboratory comparison data.
Malm L, Kruve A. Malm L, et al. Analyst. 2025 Aug 4;150(16):3567-3577. doi: 10.1039/d5an00323g. Analyst. 2025. PMID: 40671565 Free PMC article.
Assessing the Impact of Measurement Precision on Metabolite Identification Probability in Multidimensional Mass Spectrometry-Based, Reference-Free Metabolomics.
Chang CH, Schwartz SC, Im AK, Bloodsworth KJ, Webb-Robertson BM, Ewing RG, Metz TO, Ross DH. Chang CH, et al. Anal Chem. 2025 Jul 8;97(26):13861-13871. doi: 10.1021/acs.analchem.5c01067. Epub 2025 Jun 25. Anal Chem. 2025. PMID: 40556554 Free PMC article.
Large-scale generation of in silico based spectral libraries to annotate dark chemical space features in non-target analysis.
Egede Frøkjær E, Hansen M. Egede Frøkjær E, et al. Anal Bioanal Chem. 2025 Sep 2. doi: 10.1007/s00216-025-06034-4. Online ahead of print. Anal Bioanal Chem. 2025. PMID: 40892243

References

1. Black G, Lowe C, Anumol T, Bade J, Favela K, Feng Y-L, Knolhoff A, Mceachran A, Nuñez J, Fisher C, Peter K, Quinete NS, Sobus J, Sussman E, Watson W, Wickramasekara S, Williams A, Young T. Exploring chemical space in non-targeted analysis: a proposed ChemSpace tool. Anal Bioanal Chem. 2023;415:35–44. 10.1007/s00216-022-04434-4. - PMC - PubMed
1. Renner G, Reuschenbach M. Critical review on data processing algorithms in non-target screening: challenges and opportunities to improve result comparability. Anal Bioanal Chem. 2023;415:4111–23. 10.1007/s00216-023-04776-7. - PMC - PubMed
1. Hollender J, Schymanski EL, Ahrens L, Alygizakis N, Béen F, Bijlsma L, Brunner AM, Celma A, Fildier A, Fu Q, Gago-Ferrero P, Gil-Solsona R, Haglund P, Hansen M, Kaserzon S, Kruve A, Lamoree M, Margoum C, Meijer J, Merel S, Rauert C, Rostkowski P, Samanipour S, Schulze B, Schulze T, Singh RR, Slobodnik J, Steininger-Mairinger T, Thomaidis NS, Togola A, Vorkamp K, Vulliet E, Zhu L, Krauss M. NORMAN guidance on suspect and non-target screening in environmental monitoring. Environ Sci Eur. 2023;35:75. 10.1186/s12302-023-00779-4.
1. Hulleman T, Turkina V, O’Brien JW, Chojnacka A, Thomas KV, Samanipour S. Critical Assessment of the Chemical Space Covered by LC–HRMS Non-Targeted Analysis. Environ Sci Technol. 2023;57:14101–12. 10.1021/acs.est.3c03606. - PMC - PubMed
1. Manz KE, Feerick A, Braun JM, Feng Y-L, Hall A, Koelmel J, Manzano C, Newton SR, Pennell KD, Place BJ, Godri Pollitt KJ, Prasse C, Young JA. Non-targeted analysis (NTA) and suspect screening analysis (SSA): a review of examining the chemical exposome. J Expo Sci Environ Epidemiol. 2023;33:524–36. 10.1038/s41370-023-00574-6. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening

Affiliations

Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources