. 2024 Apr 16:23:1864-1876.

doi: 10.1016/j.csbj.2024.04.035. eCollection 2024 Dec.

Unitig-centered pan-genome machine learning approach for predicting antibiotic resistance and discovering novel resistance genes in bacterial strains

Duyen Thi Do¹, Ming-Ren Yang^{1

2}, Tran Nam Son Vo³, Nguyen Quoc Khanh Le⁴, Yu-Wei Wu^{1

5

6}

Affiliations

¹ Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.
² Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan.
³ Department of Business Administration, College of Management, Lunghwa University of Science and Technology, Taoyuan City, Taiwan.
⁴ Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan.
⁵ Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan.
⁶ TMU Research Center for Digestive Medicine, Taipei Medical University, Taipei, Taiwan.

PMID: 38707536
PMCID: PMC11067008
DOI: 10.1016/j.csbj.2024.04.035

Unitig-centered pan-genome machine learning approach for predicting antibiotic resistance and discovering novel resistance genes in bacterial strains

Duyen Thi Do et al. Comput Struct Biotechnol J. 2024.

. 2024 Apr 16:23:1864-1876.

doi: 10.1016/j.csbj.2024.04.035. eCollection 2024 Dec.

Authors

Duyen Thi Do¹, Ming-Ren Yang^{1

2}, Tran Nam Son Vo³, Nguyen Quoc Khanh Le⁴, Yu-Wei Wu^{1

5

6}

Affiliations

¹ Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.
² Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan.
³ Department of Business Administration, College of Management, Lunghwa University of Science and Technology, Taoyuan City, Taiwan.
⁴ Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan.
⁵ Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan.
⁶ TMU Research Center for Digestive Medicine, Taipei Medical University, Taipei, Taiwan.

PMID: 38707536
PMCID: PMC11067008
DOI: 10.1016/j.csbj.2024.04.035

Abstract

In current genomic research, the widely used methods for predicting antimicrobial resistance (AMR) often rely on prior knowledge of known AMR genes or reference genomes. However, these methods have limitations, potentially resulting in imprecise predictions owing to incomplete coverage of AMR mechanisms and genetic variations. To overcome these limitations, we propose a pan-genome-based machine learning approach to advance our understanding of AMR gene repertoires and uncover possible feature sets for precise AMR classification. By building compacted de Brujin graphs (cDBGs) from thousands of genomes and collecting the presence/absence patterns of unique sequences (unitigs) for Pseudomonas aeruginosa, we determined that using machine learning models on unitig-centered pan-genomes showed significant promise for accurately predicting the antibiotic resistance or susceptibility of microbial strains. Applying a feature-selection-based machine learning algorithm led to satisfactory predictive performance for the training dataset (with an area under the receiver operating characteristic curve (AUC) of > 0.929) and an independent validation dataset (AUC, approximately 0.77). Furthermore, the selected unitigs revealed previously unidentified resistance genes, allowing for the expansion of the resistance gene repertoire to those that have not previously been described in the literature on antibiotic resistance. These results demonstrate that our proposed unitig-based pan-genome feature set was effective in constructing machine learning predictors that could accurately identify AMR pathogens. Gene sets extracted using this approach may offer valuable insights into expanding known AMR genes and forming new hypotheses to uncover the underlying mechanisms of bacterial AMR.

Keywords: Antimicrobial resistance; De Bruijn graph, Feature selection; Pseudomonas aeruginosa; Unitig.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.

Figures

**Fig. 1**
Comprehensive workflow for AMR predictions: from compacted de Bruijn graph (cDBG) construction to feature selection and analysis.

**Fig. 2**
A simple example illustrating the construction of a cDBG for a collection of single-point-mutated sequences. A. A compilation of all k-mers (k = 5) present in two sequences (Seq 1 and 2) was generated. The last k-1 = 4 nucleotides of the first k-mer must match the first four nucleotides of the second k-mer in order to be connected. The bubble pattern denotes the single-nucleotide polymorphism (SNP) from C to G, with each arm of the bubble representing an allele. B. By compacting the linear paths of the graph, a more-concise representation of the DBG is obtained. Compared to the original DBG, which consisted of 15 nodes (k-mers), the cDBG can effectively capture the same variation with only four nodes (unitigs).

**Fig. 3**
Mechanisms of AMR in *P. aeruginosa*. These mechanisms are classified into intrinsic, acquired, and adaptive resistance. Intrinsic resistance encompasses factors such as (1) efflux systems, reduced outer membrane permeability, and (2) antibiotic inactivation by various enzymes. Acquired antibiotic resistance involves resistance by mutations and acquisition of resistance genes (7). Mutations and acquisition of resistance genes occur through HGT. Adaptive antibiotic resistance includes (5) biofilm-mediated resistance. Virulence factors encompass (1) efflux systems, (4) two-component signaling systems (TCSSs), and (6) quorum sensing.

**Fig. 4**
Pan-genome growth curve of *P. aeruginosa*. The X-axis displays the quantity of genomes (strains), whereas the Y-axis signifies the count of distinct unitigs.

**Fig. 5**
Performance evaluations of our AMR classifiers for various antibiotics in the training dataset. The left panel depicts the accuracy, and the right panel depicts AUC values.

**Fig. 6**
Comparative AMR classifier performance (AUC). This figure illustrates the performance evaluation of AMR classifiers using box-whisker plots that focus on the AUC. Performance is assessed across three distinct feature sets, with each plot corresponding to specific antibiotics. Outliers are indicated with solid circles. Significance levels: p < 0.001 (***), p < 0.01 (**), and p < 0.05 (*). Abbreviations for antibiotics: AMI, amikacin; CEF, ceftazidime; CIP, ciprofloxacin; GEN, gentamicin; IMI, imipenem; LEV, levofloxacin; MER, meropenem; and TOB, tobramycin.

**Fig. 7**
Gene content of XGBoost-selected feature subset. A. The percentage of genes identified using our approach and categorized based on common AMR mechanisms and bacterial defense and survival strategies observed in *P. aeruginosa*. B. Proportion of genes within the XGBoost-selected feature subset classified as virulence factors, accounting for diverse virulence factor subgroups.

**Fig. 8**
Antibiotic gene set intersections. This figure illustrates the intersection of eight sets of antibiotic genes. The intersection size represents the number of genes shared between multiple sets. For instance, an intersection of meropenem and ceftazidime with a value of 34 implies that there are 34 genes shared between these two antibiotics. The left lower panel represents the set size, which denotes the number of genes in each individual set. For example, the AMR gene set associated with meropenem and ceftazidime comprises 336 and 287 genes, respectively.

**Fig. 9**
List of co-resistant genes that are shared across multiple antibiotics. Genes occurring across a higher number of antibiotics are positioned at the front of the list.

See this image and copyright information in PMC

References

1. Baym M., et al. Spatiotemporal microbial evolution on antibiotic landscapes. Science. 2016;353(6304):1147–1151. - PMC - PubMed
1. Feldgarden M., et al. Validating the AMRFinder tool and resistance gene database by using antimicrobial resistance genotype-phenotype correlations in a collection of isolates. Antimicrob Agents Chemother. 2019;63(11) - PMC - PubMed
1. Wheeler N.E., et al. Contrasting approaches to genome-wide association studies impact the detection of resistance mechanisms in Staphylococcus aureus. bioRxiv. 2019
1. Alcock B.P., et al. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 2020;48(D1):D517–D525. - PMC - PubMed
1. Bortolaia V., et al. ResFinder 4.0 for predictions of phenotypes from genotypes. J Antimicrob Chemother. 2020;75(12):3491–3500. - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unitig-centered pan-genome machine learning approach for predicting antibiotic resistance and discovering novel resistance genes in bacterial strains

Affiliations

Unitig-centered pan-genome machine learning approach for predicting antibiotic resistance and discovering novel resistance genes in bacterial strains

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

Related information

LinkOut - more resources

Full Text Sources