Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 10;16(1):3420.
doi: 10.1038/s41467-025-58804-4.

Pre-trained molecular representations enable antimicrobial discovery

Affiliations

Pre-trained molecular representations enable antimicrobial discovery

Roberto Olayo-Alarcon et al. Nat Commun. .

Abstract

The rise in antimicrobial resistance poses a worldwide threat, reducing the efficacy of common antibiotics. Determining the antimicrobial activity of new chemical compounds through experimental methods remains time-consuming and costly. While compound-centric deep learning models promise to accelerate this search and prioritization process, current strategies require large amounts of custom training data. Here, we introduce a lightweight computational strategy for antimicrobial discovery that builds on MolE (Molecular representation through redundancy reduced Embedding), a self-supervised deep learning framework that leverages unlabeled chemical structures to learn task-independent molecular representations. By combining MolE representation learning with available, experimentally validated compound-bacteria activity data, we design a general predictive model that enables assessing compounds with respect to their antimicrobial potential. Our model correctly identifies recent growth-inhibitory compounds that are structurally distinct from current antibiotics. Using this approach, we discover de novo, and experimentally confirm, three human-targeted drugs as growth inhibitors of Staphylococcus aureus. This framework offers a viable, cost-effective strategy to accelerate antibiotic discovery.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Two-stage framework for antimicrobial discovery.
a The MolE pre-training framework uses a collection of 100,000 unlabeled structures from PubChem to learn a task-independent, molecular representation. Each structure is represented as a molecular graph, from which two augmentations are created (YA and YB) by masking a randomly seeded subgraph. Each augmentation is encoded by a GIN backbone to produce a concatenated vector representation (rA, rB) which is then expanded into embedding vectors (zA, zB) using an MLP head. The cross correlation between the two embedding vectors is optimized to be as similar as possible the target identity matrix using the Barlow-Twins objective function. After pre-training, any molecular structure can be encoded into a fixed-length vector representation r, which captures relevant chemical information. b Publicly available measurements of growth inhibition against 40 microbial strains are used to train a predictive model. c The pre-trained molecular representation is combined with the compound-microbe activity measurements to train a machine-learning model that produces a probability for each compound-microbe combination, indicating how likely the compound is to inhibit the microbe’s growth. These probabilities are used to estimate a collection of Antimicrobial Potential scores that serve to prioritize compounds for experimental validation.
Fig. 2
Fig. 2. Illustration of MolE’s compound representation.
a UMAP embedding of MolE's representation of 100,000 chemical structures not seen in pre-training. b Comparison of Jaccard distance computed between ECFP4 representations and cosine distance computed between MolE representations with respect to the query molecule Ractopamine (PubChem ID: 56052), shown in panel c. Comparison of the four molecules with smallest distances (ranks) to Ractopamine according to MolE (top row) and ECFP4 (bottom row).
Fig. 3
Fig. 3. Predicting antimicrobial activity in the human gut microbiome.
a Molecular representations (ECFP4, Chemical Descriptors or MolE) are concatenated with a one-hot-encoding of the microbial strains to train an XGBoost model. b The model produces an antimicrobial predictive probability for each compound-microbe pair. The log-geometric mean of all probabilities corresponds to AP score G. c The predictive probabilities are thresholded into binary predictions of growth inhibition; the total number of strains predicted to be inhibited is determined (K). d Precision-recall curves on the test set for models trained with each molecular representation. PR-AUC is shown in legend. e Binary predictions for the growth-inhibitory activity of Diacerein by each model. The experimentally validated activity is shown in the bottom row. Each column is an individual strain. f Predictions for the antimicrobial activity of Halicin and Abaucin made by MolE-XGBoost. g List of 44 compounds in the test set comprising 24 compounds with experimentally validated broad-spectrum activity (i.e., inhibited strains ≥10) grouped by their intended target species. Each column represents a compound. Each entry represents the number of predicted inhibited strains (color-coded). The last row represents the experimentally determined ground truth number of inhibited strains.
Fig. 4
Fig. 4. Predicting antimicrobial potential in the discovery MCE-based chemical library comprising 2320 compounds.
a UMAP embedding of MolE's representation of the 2320 compounds for which predictions are made. Compounds predicted to inhibit at least 10 strains (K ≥ 10) are highlighted in blue. b Literature-reported categorization of the antimicrobial activity for the 235 compounds with K ≥ 10. This set comprises 77 antibiotics (33%, shown in red) and 158 non-antibiotic drugs (67%, shown in blue). The non-antibiotic drugs are further categorized into five classes (colored in the outer ring). c Antimicrobial Potential score G vs. number of predicted inhibited strains K of all 2320 compounds. The dashed line marks K = 10. All known antibiotics present in the library (n = 93 antibiotics) are shown in red, while non-antibiotic compounds with K ≥ 10 (n = 158 compounds) are shown in blue. Boxplots show the distribution of G and K, with the median value shown as the middle line, first (Q1) and third quartiles (Q3) shown as the box limits, and the whiskers extending to the most extreme data points within 1.5 times the interquartile range. The top boxplots show the distribution of G for antibiotics (n = 93, median = −2.18, Q1 = −3.86, Q3 = −1.05, lower whisker = −7.71, upper whisker = −7.4 × 10−5), non-antibiotics with K ≥ 10 (n = 158, median = −4.38, Q1 = −5.47, Q3 = − 2.91, lower whisker = −7.83, upper whisker = −0.45), and non-antibiotics with K < 10 (n = 2068, median = −13.27, Q1 = −16.09, Q3 = −10.56, lower whisker = −20.43, upper whisker = −5.73). Similarly, boxplots on the right show the distribution of K for antibiotics (median = 33, Q1 = 25, Q3 = 37, lower whisker = 8, upper whisker = 40), non-antibiotics with K ≥ 10 (median = 23, Q1 = 15, Q3 = 32, lower whisker = 10, upper whisker = 32), and non-antibiotics with K < 10 (median = 0, Q1 = 0, Q3 = 0, lower whisker = 0, upper whisker = 0). d Scatter plot of the Antimicrobial Potential for Gram-positive (G+) vs. Gram-negative strains (G) determined for the 158 non-antibiotic drugs with predicted broad-spectrum activity (K ≥ 10). The coloring of each compound corresponds to the categorization in (b).
Fig. 5
Fig. 5. Relationship between AP scores and minimum inhibitory concentration (MIC).
a Regression and correlation analysis between the AP score G and the corresponding log2 literature-reported MICs (μg/mL) for 31 non-antibiotic compounds against Gram-positive (shown in blue) and Gram-negative (shown in yellow) species. In total, 39 compound-species combinations are shown, Spearman’s ρ = −0.5 (two-sided correlation test p value  = 0.001). The linear regression fit is shown in pink with standard error shown as gray bands (slope  = −0.34, two-sided coefficient test p value  = 0.004). b AP scores (AP-G) and MIC (μg/mL) values of the compounds with top-3 and bottom-3 MIC values along with the respective inhibited bacterial species.
Fig. 6
Fig. 6. Experimental validation of antimicrobial activity.
a Average OD600 measurements ( ± standard deviation) after 10 hours of growth at increasing concentrations of each compound. Three biological replicates were performed for each compound-concentration-species combination. b Growth curves for S. aureus when grown in the presence of Water (33 growth curves gathered from 3 biological replicates), DMSO (3 biological replicates), and Opicapone (3 biological replicates for each concentration). c Growth curve for S. aureus when grown in the presence of Water (33 growth curves gathered from 3 biological replicates), DMSO (3 biological replicates), and Ebastine (3 biological replicates).

References

    1. Miethke, M. et al. Towards the sustainable discovery and development of new antibiotics. Nat. Rev. Chem.5, 726–749 (2021). - PMC - PubMed
    1. Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell180, 688–702 (2020). - PMC - PubMed
    1. Tommasi, R., Brown, D. G., Walkup, G. K., Manchester, J. I. & Miller, A. A. Eskapeing the labyrinth of antibacterial discovery. Nat. Rev. Drug Discov.14, 529–542 (2015). - PubMed
    1. Algavi, Y. M. & Borenstein, E. A data-driven approach for predicting the impact of drugs on the human microbiome. Nat. Commun.14, 3614 (2023). - PMC - PubMed
    1. Pandey, M. et al. The transformational role of gpu computing and deep learning in drug discovery. Nat. Mach. Intell.4, 211–221 (2022).

MeSH terms

Substances

LinkOut - more resources