Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 11;187(14):3761-3778.e16.
doi: 10.1016/j.cell.2024.05.013. Epub 2024 Jun 5.

Discovery of antimicrobial peptides in the global microbiome with machine learning

Affiliations

Discovery of antimicrobial peptides in the global microbiome with machine learning

Célio Dias Santos-Júnior et al. Cell. .

Abstract

Novel antibiotics are urgently needed to combat the antibiotic-resistance crisis. We present a machine-learning-based approach to predict antimicrobial peptides (AMPs) within the global microbiome and leverage a vast dataset of 63,410 metagenomes and 87,920 prokaryotic genomes from environmental and host-associated habitats to create the AMPSphere, a comprehensive catalog comprising 863,498 non-redundant peptides, few of which match existing databases. AMPSphere provides insights into the evolutionary origins of peptides, including by duplication or gene truncation of longer sequences, and we observed that AMP production varies by habitat. To validate our predictions, we synthesized and tested 100 AMPs against clinically relevant drug-resistant pathogens and human gut commensals both in vitro and in vivo. A total of 79 peptides were active, with 63 targeting pathogens. These active AMPs exhibited antibacterial activity by disrupting bacterial membranes. In conclusion, our approach identified nearly one million prokaryotic AMP sequences, an open-access resource for antibiotic discovery.

Keywords: antibiotic discovery; antibiotic resistance; antimicrobial activity; antimicrobial peptides; global microbiome; machine learning; metagenomics.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests C.F.-N. provides consulting services to Invaio Sciences and is a member of the Scientific Advisory Boards of Nowture S.L. and Phare Bio. The de la Fuente Lab has received research funding or in-kind donations from United Therapeutics, Strata Manufacturing PJSC, and Procter & Gamble, none of which were used in support of this work. An invention disclosure associated with this work has been submitted.

Figures

Figure 1.
Figure 1.. AMPSphere comprises 836,498 non-redundant c_AMPs from thousands of metagenomes and high-quality microbial genomes.
(A) To build the AMPSphere, we first assembled 63,410 publicly available metagenomes from diverse habitats. A modified version of Prodigal, which can also predict smORFs (30–300 bp), was used to predict genes on the resulting metagenomic contigs as well as on 87,920 microbial genomes from ProGenomes2. Macrel was applied to the 4,599,187,424 predicted smORFs to obtain 863,498 non-redundant c_AMPs (see also Fig. SI1). c_AMPs were then hierarchically clustered in a reduced amino acids alphabet using 100%, 85%, and 75% identity cutoffs. We observed at 75% of identity 118,051 non-singleton clusters, and 8,788 of them were considered families (≥ 8 c_AMPs). (B) Only 9% of c_AMPs have detectable homologs in other small protein databases (SmProt 2, STsORFs), bioactive peptide databases (DRAMP version 3.0, starPepDB 45k), and general protein datasets (GMGCv1) - see also Fig. SI2B. Also shown is the number of homologs in the AMPSphere in each database as well as the total. The number of homologs passing all of our quality tests, regardless their experimental evidence of translation/transcription is also shown along with the percentage it represents in the homologs identified. Note that some peptides have homologs in multiple databases and, thus, the total count is not the sum of the individual databases. (C) Shown are rarefaction curves showing how AMP discovery is impacted by sampling, with most of the habitats presenting steep sampling curves. (D) Sharing of c_AMPs between habitats is limited. The width of ribbons represents the proportion of the shared c_AMPs in the habitat on the left - see also Fig. SI2C–D and Tables SI1 and SI2.
Figure 2.
Figure 2.. Quality Control of AMPSphere candidates.
The number of AMPSphere candidates passing each of the tests proposed for quality is shown in (A). The high-quality set is composed of 7.3% of candidates without experimental evidence and 2% of candidates with evidence of their translation or transcription. The number of homologs found in the high-quality set of AMP candidates. Although the high-quality set displays some overlap with the homologs, most of the homologs are not found in the high-quality set. (B) The number of AMP candidates co-predicted by AMP prediction systems beyond Macrel (AMPScanner v2, ampir - with the model for mature peptides, amPEPpy, APIN – with their proposed model, AI4AMP, and AMPLify). Only a small portion of AMPSphere (<2%) cannot be co-predicted by any system other than Macrel.
Figure 3.
Figure 3.. Mutations in genes encoding large proteins generate c_AMPs as independent genomic entities.
(A) The distribution of positions (as a percentage of the length of the larger protein) from which the AMP homologs start their alignment is shown. About 7% of c_AMPs are homologous to proteins from GMGCv1, with approximately one-fourth of the hits sharing start positions with the larger protein. (B) As an illustrative example of an AMP homologous to a full-length protein, AMP10.271_016 was recovered from three samples of human saliva from the same donor. AMP10.271_016 is predicted to be produced by Prevotella jejuni, sharing the start codon (bolded) of an NAD(P)-dependent dehydrogenase gene (WP_089365220.1), the transcription of which was stopped by a mutation (in red; TGG > TGA). (C) The distribution of AMPs per OG class (left) and their enrichment in comparison to full-length proteins from GMGCv1 (right). OGs were classified into subgroups according to the number of c_AMPs they were affiliated with. The OGs of unknown function represent the largest (2,041 out of 3,792 OGs) and most enriched (PKruskal = 2.66·10−39) class with homologs to c_AMPs in GMGCv1. Interestingly, when considered individually, the number of c_AMP hits to unknown OGs was the lowest (PKruskal = 6·10−3). These results do not change when underrepresented OGs are excluded by using different thresholds (e.g., at least 10, 20, or 100 homologs per OG) - see also Table SI3.
Figure 4.
Figure 4.. The genome context of c_AMPs shows a preference for neighborhoods containing ribosome assembly proteins - see Table SI4.
(A) Compared to other proteins, c_AMPs in conserved genomic architectures tend to be closer to ribosomal machinery-related genes than families of proteins with different sizes (all length and small proteins with ≤ 50 amino acids). (B) The proportion of c_AMPs in a genome context involving antibiotic resistance genes is lower than in other gene families. (C) The proportion of c_AMPs in neighborhoods with antibiotic synthesis-related genes is very small (<0.25%). (D) The conserved genomic context of the gene encoding AMP10.015_426 is shown in different genomes (the tree on the left depicts the phylogenetic relationship of the genes homologous to it). This c_AMP is homologous to the ribosomal protein rpsH, and is found in the context of rpsH and other ribosomal protein genes.
Figure 5.
Figure 5.. AMP variation in AMPSphere database is taxonomy dependent.
(A) Shown are the fractions of AMPs (or AMP families) that are accessory (present in <50% of genomes from same species), shell (50–95%), or core (≥95%). (B) Distribution of the lowest taxonomic level at which c_AMPs were annotated. In detail (right), the top 10 genera with the highest numbers of c_AMPs included in AMPSphere. Animal-associated genera (e.g., Prevotella, Faecalibacterium, CAG-110) contribute the most c_AMPs, possibly reflecting data sampling. (C) Using the ρAMP per genus (calculated with c_AMPs in AMPSphere), we observed the distribution of c_AMPs per phylum, with Bacillota A as the densest (the number of samples used to build the graph is shown above each box). (D) Taxonomy of the detected taxa in AMPSphere, is shown using the GTDB, reference tree. The gray bars show ρAMP distribution with respect to taxonomy, with black bars representing the confidence interval of 95%. Bacillota A, Actinomycetota, and Pseudomonadota are the densest phyla in c_AMPs. As a reference, the median of ρAMP for the presented genera is indicated by a magenta dashed line.
Figure 6.
Figure 6.. Amino acid composition, structure, antimicrobial activity, and mechanism of action of c_AMPs.
(A) Amino acid frequency in c_AMPs from AMPSphere, AMPs from databases (DRAMP version 3, APD3, and DBAASP), and encrypted peptides (EPs) from the human proteome. (B) Heat map with the percentage of secondary structure found for each peptide in three different solvents: water, 60% trifluoroethanol (TFE) in water, and 50% methanol (MeOH) in water. Secondary structure was calculated using BeStSel server. (C) Activity of c_AMPs assessed against ESKAPEE pathogens and human gut commensal strains. Briefly, 106 CFU·mL−1 was exposed to c_AMPs two-fold serially diluted ranging from 64 to 1 μmol·L−1 in 96-wells plates and incubated at 37 °C for one day. After the exposure period, the absorbance of each well was measured at 600 nm. Untreated solutions were used as controls and minimal concentration values for complete inhibition were presented as a heat map of antimicrobial activities (μmol·L−1) against 11 pathogenic and eight human gut commensal bacterial strains. All the assays were performed in three independent replicates and the heatmap shows the mode obtained within the two-fold dilutions concentration range studied. Gram-positive (+) and Gram-negative (−) bacteria are indicated as such on top panel C. (D) Fluorescence values relative to polymyxin B (PMB, positive control) of the fluorescent probe 1-(N-phenylamino)naphthalene (NPN) that indicate outer membrane permeabilization of A. baumannii ATCC 19606 cells. (E) Fluorescence values relative to PMB (positive control) of 3,3′-dipropylthiadicarbocyanine iodide [DiSC3-(5)], a hydrophobic fluorescent probe used to indicate cytoplasmic membrane depolarization of A. baumannii ATCC 19606 cells. Depolarization of the cytoplasmic membrane occurred with a slow kinetics compared to the permeabilization of the outer membrane and took approximately 20 min to stabilize.
Figure 7.
Figure 7.. Anti-infective activity of AMPs in pre-clinical animal model.
(A) Schematic of the skin abscess mouse model used to assess the anti-infective activity of the peptides against A. baumannii cells. (B) Peptides were tested at their MIC in a single dose one hour after the establishment of the infection. Each group consisted of three mice (n = 3) and the bacterial loads used to infect each mouse derived from a different inoculum. (C) To rule out toxic effects of the peptides, mouse weight was monitored throughout the experiment. Statistical significance in (B) was determined using one-way ANOVA where all groups were compared to the untreated control group; P-values are shown for each of the groups. Features on the violin plots represent median and upper and lower quartiles. Data in (C) are the mean ± the standard deviation. Figure created in BioRender.com.

Update of

Comment in

  • Machine learning identifies AMPs.
    Crunkhorn S. Crunkhorn S. Nat Rev Drug Discov. 2024 Aug;23(8):581. doi: 10.1038/d41573-024-00111-6. Nat Rev Drug Discov. 2024. PMID: 38937616 No abstract available.

References

    1. de la Fuente-Nunez C, Torres MD, Mojica FJ, and Lu TK (2017). Next-generation precision antimicrobials: towards personalized treatment of infectious diseases. Current Opinion in Microbiology 37, 95–102. 10.1016/j.mib.2017.05.014. - DOI - PMC - PubMed
    1. Antimicrobial Resistance Collaborators (2022). Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 399, 629–655. 10.1016/S0140-6736(21)02724-0. - DOI - PMC - PubMed
    1. Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, Mac-Nair CR, French S, Carfrae LA, Bloom-Ackermann Z, et al. (2020). A Deep Learning Approach to Antibiotic Discovery. Cell 180, 688–702.e13. 10.1016/j.cell.2020.01.021. - DOI - PMC - PubMed
    1. Torres MDT, Melo MCR, Flowers L, Crescenzi O, Notomista E, and de la Fuente-Nunez C (2022). Mining for encrypted peptide antibiotics in the human proteome. Nat Biomed Eng 6, 67–75. 10.1038/s41551-021-00801-1. - DOI - PubMed
    1. Porto WF, Irazazabal L, Alves ESF, Ribeiro SM, Matos CO, Pires ÁS, Fensterseifer ICM, Miranda VJ, Haney EF, Humblot V, et al. (2018). In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nat Commun 9, 1490. 10.1038/s41467-018-03746-3. - DOI - PMC - PubMed

Substances

LinkOut - more resources