Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 16:23:1864-1876.
doi: 10.1016/j.csbj.2024.04.035. eCollection 2024 Dec.

Unitig-centered pan-genome machine learning approach for predicting antibiotic resistance and discovering novel resistance genes in bacterial strains

Affiliations

Unitig-centered pan-genome machine learning approach for predicting antibiotic resistance and discovering novel resistance genes in bacterial strains

Duyen Thi Do et al. Comput Struct Biotechnol J. .

Abstract

In current genomic research, the widely used methods for predicting antimicrobial resistance (AMR) often rely on prior knowledge of known AMR genes or reference genomes. However, these methods have limitations, potentially resulting in imprecise predictions owing to incomplete coverage of AMR mechanisms and genetic variations. To overcome these limitations, we propose a pan-genome-based machine learning approach to advance our understanding of AMR gene repertoires and uncover possible feature sets for precise AMR classification. By building compacted de Brujin graphs (cDBGs) from thousands of genomes and collecting the presence/absence patterns of unique sequences (unitigs) for Pseudomonas aeruginosa, we determined that using machine learning models on unitig-centered pan-genomes showed significant promise for accurately predicting the antibiotic resistance or susceptibility of microbial strains. Applying a feature-selection-based machine learning algorithm led to satisfactory predictive performance for the training dataset (with an area under the receiver operating characteristic curve (AUC) of > 0.929) and an independent validation dataset (AUC, approximately 0.77). Furthermore, the selected unitigs revealed previously unidentified resistance genes, allowing for the expansion of the resistance gene repertoire to those that have not previously been described in the literature on antibiotic resistance. These results demonstrate that our proposed unitig-based pan-genome feature set was effective in constructing machine learning predictors that could accurately identify AMR pathogens. Gene sets extracted using this approach may offer valuable insights into expanding known AMR genes and forming new hypotheses to uncover the underlying mechanisms of bacterial AMR.

Keywords: Antimicrobial resistance; De Bruijn graph, Feature selection; Pseudomonas aeruginosa; Unitig.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.

Figures

ga1
Graphical abstract
Fig. 1
Fig. 1
Comprehensive workflow for AMR predictions: from compacted de Bruijn graph (cDBG) construction to feature selection and analysis.
Fig. 2
Fig. 2
A simple example illustrating the construction of a cDBG for a collection of single-point-mutated sequences. A. A compilation of all k-mers (k = 5) present in two sequences (Seq 1 and 2) was generated. The last k-1 = 4 nucleotides of the first k-mer must match the first four nucleotides of the second k-mer in order to be connected. The bubble pattern denotes the single-nucleotide polymorphism (SNP) from C to G, with each arm of the bubble representing an allele. B. By compacting the linear paths of the graph, a more-concise representation of the DBG is obtained. Compared to the original DBG, which consisted of 15 nodes (k-mers), the cDBG can effectively capture the same variation with only four nodes (unitigs).
Fig. 3
Fig. 3
Mechanisms of AMR in P. aeruginosa. These mechanisms are classified into intrinsic, acquired, and adaptive resistance. Intrinsic resistance encompasses factors such as (1) efflux systems, reduced outer membrane permeability, and (2) antibiotic inactivation by various enzymes. Acquired antibiotic resistance involves resistance by mutations and acquisition of resistance genes (7). Mutations and acquisition of resistance genes occur through HGT. Adaptive antibiotic resistance includes (5) biofilm-mediated resistance. Virulence factors encompass (1) efflux systems, (4) two-component signaling systems (TCSSs), and (6) quorum sensing.
Fig. 4
Fig. 4
Pan-genome growth curve of P. aeruginosa. The X-axis displays the quantity of genomes (strains), whereas the Y-axis signifies the count of distinct unitigs.
Fig. 5
Fig. 5
Performance evaluations of our AMR classifiers for various antibiotics in the training dataset. The left panel depicts the accuracy, and the right panel depicts AUC values.
Fig. 6
Fig. 6
Comparative AMR classifier performance (AUC). This figure illustrates the performance evaluation of AMR classifiers using box-whisker plots that focus on the AUC. Performance is assessed across three distinct feature sets, with each plot corresponding to specific antibiotics. Outliers are indicated with solid circles. Significance levels: p < 0.001 (***), p < 0.01 (**), and p < 0.05 (*). Abbreviations for antibiotics: AMI, amikacin; CEF, ceftazidime; CIP, ciprofloxacin; GEN, gentamicin; IMI, imipenem; LEV, levofloxacin; MER, meropenem; and TOB, tobramycin.
Fig. 7
Fig. 7
Gene content of XGBoost-selected feature subset. A. The percentage of genes identified using our approach and categorized based on common AMR mechanisms and bacterial defense and survival strategies observed in P. aeruginosa. B. Proportion of genes within the XGBoost-selected feature subset classified as virulence factors, accounting for diverse virulence factor subgroups.
Fig. 8
Fig. 8
Antibiotic gene set intersections. This figure illustrates the intersection of eight sets of antibiotic genes. The intersection size represents the number of genes shared between multiple sets. For instance, an intersection of meropenem and ceftazidime with a value of 34 implies that there are 34 genes shared between these two antibiotics. The left lower panel represents the set size, which denotes the number of genes in each individual set. For example, the AMR gene set associated with meropenem and ceftazidime comprises 336 and 287 genes, respectively.
Fig. 9
Fig. 9
List of co-resistant genes that are shared across multiple antibiotics. Genes occurring across a higher number of antibiotics are positioned at the front of the list.

Similar articles

References

    1. Baym M., et al. Spatiotemporal microbial evolution on antibiotic landscapes. Science. 2016;353(6304):1147–1151. - PMC - PubMed
    1. Feldgarden M., et al. Validating the AMRFinder tool and resistance gene database by using antimicrobial resistance genotype-phenotype correlations in a collection of isolates. Antimicrob Agents Chemother. 2019;63(11) - PMC - PubMed
    1. Wheeler N.E., et al. Contrasting approaches to genome-wide association studies impact the detection of resistance mechanisms in Staphylococcus aureus. bioRxiv. 2019
    1. Alcock B.P., et al. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 2020;48(D1):D517–D525. - PMC - PubMed
    1. Bortolaia V., et al. ResFinder 4.0 for predictions of phenotypes from genotypes. J Antimicrob Chemother. 2020;75(12):3491–3500. - PMC - PubMed

LinkOut - more resources