Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Jan;417(3):473-493.
doi: 10.1007/s00216-024-05471-x. Epub 2024 Aug 14.

Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening

Affiliations
Review

Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening

Henrik Hupatz et al. Anal Bioanal Chem. 2025 Jan.

Abstract

Non-targeted screening with liquid chromatography coupled to high-resolution mass spectrometry (LC/HRMS) is increasingly leveraging in silico methods, including machine learning, to obtain candidate structures for structural annotation of LC/HRMS features and their further prioritization. Candidate structures are commonly retrieved based on the tandem mass spectral information either from spectral or structural databases; however, the vast majority of the detected LC/HRMS features remain unannotated, constituting what we refer to as a part of the unknown chemical space. Recently, the exploration of this chemical space has become accessible through generative models. Furthermore, the evaluation of the candidate structures benefits from the complementary empirical analytical information such as retention time, collision cross section values, and ionization type. In this critical review, we provide an overview of the current approaches for retrieving and prioritizing candidate structures. These approaches come with their own set of advantages and limitations, as we showcase in the example of structural annotation of ten known and ten unknown LC/HRMS features. We emphasize that these limitations stem from both experimental and computational considerations. Finally, we highlight three key considerations for the future development of in silico methods.

Keywords: Generative modeling; Machine learning; Non-targeted analysis; Non-targeted screening; Suspect screening; Untargeted screening.

PubMed Disclaimer

Conflict of interest statement

Declarations. Conflict of interest: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Experimental workflow for analyzing an environmental sample using LC/IM/HRMS experiment with electrospray ionization (ESI). Dark brown indicates experimental analytical features (RT tI, CCS aI, and m/z mI) of an unknown structure, while light brown marks its MS2 features (pI1, pI2, and pI3). CCS values are derived from arrival time distributions (ATD). The schematical table with analytical information will appear in subsequent figures to highlight the in silico structural annotation workflow for LC/HRMS features
Fig. 2
Fig. 2
In silico approaches for retrieving candidate structures, depicted as SMILES (simplified molecular input line entry system) notations, from MS2 spectra. The shown MS2 data are arbitrarily generated and do not correspond to any specific LC/HRMS feature or structure. Brown arrows indicate that candidate structures for the same LC/HRMS feature can be obtained with all four approaches. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently in the following figures
Fig. 3
Fig. 3
Uniform Manifold Approximation and Projection (UMAP) plots illustrating the chemical space coverage of datasets widely used for LC/HRMS feature annotation (MassBank [12] and SIRIUS [26]) and for training ML models (RTI [34] and CCSBase v1.2 [35]) applied for predicting empirical analytical information used to prioritize candidate structures. The latent space of all relevant chemicals in environmental analysis was learned based on the SIRIUS+CSI:FingerID positive mode fingerprint (3878 bits) calculated from the SMILES representation of 370,167 chemicals in the PubChemLite 0.3.0 dataset. The resulting UMAP embedding was applied to all the datasets (4310 chemicals from MassBank, 21,188 chemicals from SIRIUS+CSI:FingerID positive mode training data, 1426 chemicals from RTI training data, and 4771 chemicals from CCSBase training data). For additional details, refer to Supplementary Information 1 (SI1) Section S6
Fig. 4
Fig. 4
Training strategies employed by various GMs developed for candidate structure generation based on HRMS data, addressing the sparsity of training data. MassGenie [31] and MS2Mol [32] (blue) employ in silico and experimental databases for training. MSNovelist [33] (brown) is trained on the molecular fingerprints of chemicals from structural databases. The decoders of Spec2Mol [30] and JTVAE [29] (violet) are pre-trained on SMILES-to-SMILES translation. Mass2SMILES [28] (red) utilizes only experimental databases. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently throughout the figures of the manuscript
Fig. 5
Fig. 5
Computational workflow illustrating the training process of an empirical analytical information (EAI) prediction model using RT as an example. The model is trained by utilizing molecular fingerprints and/or descriptors, followed by empirical analytical information prediction for candidate structures. The brown arrow indicates that retention times can be predicted for each candidate structure. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently throughout the figures of the manuscript
Fig. 6
Fig. 6
Three chemicals sharing the same molecular formula (C10H10O4) can exhibit distinct retention times and be detectable with different ESI modes, influenced by their polarity and acid–base properties. A The peak corresponding to dimethyl phthalate (violet) is magnified by a factor of 10 × for enhanced visibility of other chromatographic peaks. B Additionally, adduct formation and in-source fragmentation may offer supplementary insights into the localization of functional groups
Fig. 7
Fig. 7
Visualization of the structural annotation and candidate structure prioritization results for the six LC/HRMS features out of the 20 LC/HRMS features studied (remaining features are provided in the S11 Section S8). A The number of candidate structures obtained from experimental and in silico spectra matching with MassBank and MetFrag, and by employing SIRIUS+CSI:FingerID and Spec2Mol. Each candidate structure is represented by a colored circle, with the order indicating its rank within the annotation approach. Dual-colored circles represent candidate structures suggested by two methods. The middle panel illustrates the number of candidate structures prioritized based on predicted RT obtained by utilizing the RTI model and CCS obtained by employing the CCSbase model. For features corresponding to the spiked chemicals, the correct structure is highlighted with a dark blue exclamation mark. B Visualization of the candidate structures in the chemical space using the UMAP embedding of PubChemLite (Fig. 3). All points are transparent, resulting in a darker color when data points are overlaid
Fig. 8
Fig. 8
A Heatmap illustrating the structural similarities among candidates suggested by four methods employed for the structural annotation of 20 LC/HRMS features (SI1 Section S7). Each small colored square represents the similarity of candidate structures, calculated as the average of all pairwise Tanimoto similarities between all the suggested candidates within one LC/HRMS feature. The LC/HRMS features are sorted based on their m/z values. Brown indicates higher similarity, while green indicates lower similarity among candidate structures (the white midpoint of the colorbar (0.22) denotes the average similarity across all the suggested candidate structures, calculated as the mean using all the pairwise Tanimoto similarities of candidates). Light blue indicates that a specific LC/HRMS feature did not yield any candidate structures from a particular method. Numbers inside the larger squares represent the overall average similarity scores within or between the methods. B Experimentally obtained retention time (RT) values for 20 LC/HRMS features plotted against the predicted RT values from the RTI model for all candidate structures corresponding to each LC/HRMS feature. Candidates prioritized using the cutoff criterion of ± 2 standard residuals are highlighted in light brown. C Experimentally obtained CCS values for 20 LC/HRMS features plotted against the predicted CCS from the CCSbase model for all candidate structures corresponding to each LC/HRMS feature. Candidates prioritized using the criterion of difference between predicted and experimental values less than 3% are highlighted in light turquoise

Similar articles

Cited by

References

    1. Black G, Lowe C, Anumol T, Bade J, Favela K, Feng Y-L, Knolhoff A, Mceachran A, Nuñez J, Fisher C, Peter K, Quinete NS, Sobus J, Sussman E, Watson W, Wickramasekara S, Williams A, Young T. Exploring chemical space in non-targeted analysis: a proposed ChemSpace tool. Anal Bioanal Chem. 2023;415:35–44. 10.1007/s00216-022-04434-4. - PMC - PubMed
    1. Renner G, Reuschenbach M. Critical review on data processing algorithms in non-target screening: challenges and opportunities to improve result comparability. Anal Bioanal Chem. 2023;415:4111–23. 10.1007/s00216-023-04776-7. - PMC - PubMed
    1. Hollender J, Schymanski EL, Ahrens L, Alygizakis N, Béen F, Bijlsma L, Brunner AM, Celma A, Fildier A, Fu Q, Gago-Ferrero P, Gil-Solsona R, Haglund P, Hansen M, Kaserzon S, Kruve A, Lamoree M, Margoum C, Meijer J, Merel S, Rauert C, Rostkowski P, Samanipour S, Schulze B, Schulze T, Singh RR, Slobodnik J, Steininger-Mairinger T, Thomaidis NS, Togola A, Vorkamp K, Vulliet E, Zhu L, Krauss M. NORMAN guidance on suspect and non-target screening in environmental monitoring. Environ Sci Eur. 2023;35:75. 10.1186/s12302-023-00779-4.
    1. Hulleman T, Turkina V, O’Brien JW, Chojnacka A, Thomas KV, Samanipour S. Critical Assessment of the Chemical Space Covered by LC–HRMS Non-Targeted Analysis. Environ Sci Technol. 2023;57:14101–12. 10.1021/acs.est.3c03606. - PMC - PubMed
    1. Manz KE, Feerick A, Braun JM, Feng Y-L, Hall A, Koelmel J, Manzano C, Newton SR, Pennell KD, Place BJ, Godri Pollitt KJ, Prasse C, Young JA. Non-targeted analysis (NTA) and suspect screening analysis (SSA): a review of examining the chemical exposome. J Expo Sci Environ Epidemiol. 2023;33:524–36. 10.1038/s41370-023-00574-6. - PMC - PubMed

LinkOut - more resources