Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 2;27(9):110623.
doi: 10.1016/j.isci.2024.110623. eCollection 2024 Sep 20.

PanKA: Leveraging population pangenome to predict antibiotic resistance

Affiliations

PanKA: Leveraging population pangenome to predict antibiotic resistance

Van Hoan Do et al. iScience. .

Abstract

Machine learning has the potential to be a powerful tool in the fight against antimicrobial resistance (AMR), a critical global health issue. Machine learning can identify resistance mechanisms from DNA sequence data without prior knowledge. The first step in building a machine learning model is a feature extraction from sequencing data. Traditional methods like single nucleotide polymorphism (SNP) calling and k-mer counting yield numerous, often redundant features, complicating prediction and analysis. In this paper, we propose PanKA, a method using the pangenome to extract a concise set of relevant features for predicting AMR. PanKA not only enables fast model training and prediction but also improves accuracy. Applied to the Escherichia coli and Klebsiella pneumoniae bacterial species, our model is more accurate than conventional and state-of-the-art methods in predicting AMR.

Keywords: bacteriology; genomics; machine learning.

PubMed Disclaimer

Conflict of interest statement

M.D.C., T.N., and H.A.N. are founders of AMROMICS JSC. H.S.N is a consultant to AMROMICS JSC.

Figures

None
Graphical abstract
Figure 1
Figure 1
Overview of the PanKA algorithm (topPangenome construction) A pangenome is constructed from a collection of genomic sequences of strains. Pan-genes contain gene clusters comprising at least 65% of all strains, and AMR genes refer to the gene clusters with at least one AMR gene. (middleFeature engineering) PanKA extracts the presence and absence matrix (PA matrix) from the pangenome, the amino acid variants (AVs) from the pan-genes, and k-mer profiles from the AMR gene clusters. (bottomFeature selection & prediction) PanKA performs feature selection and prediction using the LightGBM model. The LightGBM model ranks the features according to their relevance in the prediction and extracts meaningful features from the data.
Figure 2
Figure 2
Prediction performance on the E. coli dataset The classification performance of resistance prediction on the E. coli datasets is illustrated in a bar plot, which displays the F1-score for each method on a test set consisting of 20% of the samples. PanKA is a combination of 3 features: PanCore, AMR Kmer and PA matrix. KmerDNA refers to applying LightGBM to k-mer features extracted from the whole DNA sequence, while KmerProtein refers to applying LightGBM to k-mer features of protein-coding gene sequences. PanPred (default) trained gradient boosted decision trees (GBDT), and PanPred (LightGBM) retrained the PanPred model using LightGBM.
Figure 3
Figure 3
Prediction performance on the K. pneumoniae dataset The classification performance of resistant prediction on the K. pneumoniae datasets is depicted in a barplot, similar to the E. coli datasets. KmerDNA refers to applying LightGBM to k-mer features extracted from the whole DNA sequence, while KmerProtein refers to applying LightGBM to k-mer features of protein-coding gene sequences. PanPred (default) trained gradient boosted decision trees (GBDT), and PanPred (LightGBM) retrained the PanPred model using LightGBM.
Figure 4
Figure 4
Feature ranking on the E. coli dataset Feature ranking (top 10) by “gain” score (top) and “split” score (bottom) for predicting antibiotics on the E. coli dataset. The “gain” score represents the improvement in the objective function resulting from adding a split point to a tree node. Higher gain indicates that the feature provides more significant information for making predictions. The “split” score is defined as the number of times a feature is used to split the tree. A higher split score indicates that the feature is frequently used to make decisions in the model.
Figure 5
Figure 5
Feature ranking on the K. pneumoniae dataset Feature ranking (top 10) by “gain” score (top) and “split” score (bottom) for predicting antibiotics on the K. pneumoniae dataset. The “gain” score represents the improvement in the objective function resulting from adding a split point to a tree node. Higher gain indicates that the feature provides more significant information for making predictions. The “split” score is defined as the number of times a feature is used to split the tree. A higher split score indicates that the feature is frequently used to make decisions in the model.

References

    1. Sugden R., Kelly R., Davies S. Combatting antimicrobial resistance globally. Nat. Microbiol. 2016;1 doi: 10.1038/nmicrobiol.2016.187. - DOI - PubMed
    1. Chinemerem Nwobodo D., Ugwu M.C., Oliseloke Anie C., Al-Ouqaili M.T.S., Chinedu Ikem J., Victor Chigozie U., Saki M. Antibiotic resistance: The challenges and some emerging strategies for tackling a global menace. J. Clin. Lab. Anal. 2022;36 doi: 10.1002/jcla.24655. - DOI - PMC - PubMed
    1. Tagliabue A., Rappuoli R. Changing priorities in vaccinology: Antibiotic resistance moving to the top. Front. Immunol. 2018;9:1068. doi: 10.3389/fimmu.2018.01068. - DOI - PMC - PubMed
    1. Roope L.S.J., Smith R.D., Pouwels K.B., Buchanan J., Abel L., Eibich P., Butler C.C., Tan P.S., Walker A.S., Robotham J.V., Wordsworth S. The challenge of antimicrobial resistance: What economics can contribute. Science. 2019;364 doi: 10.1126/science.aau4679. - DOI - PubMed
    1. Khaledi A., Weimann A., Schniederjans M., Asgari E., Kuo T.H., Oliver A., Cabot G., Kola A., Gastmeier P., Hogardt M., et al. Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning-enabled molecular diagnostics. EMBO Mol. Med. 2020;12 doi: 10.15252/emmm.201910264. - DOI - PMC - PubMed

LinkOut - more resources