Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 1;34(13):i89-i95.
doi: 10.1093/bioinformatics/bty276.

A pan-genome-based machine learning approach for predicting antimicrobial resistance activities of the Escherichia coli strains

Affiliations

A pan-genome-based machine learning approach for predicting antimicrobial resistance activities of the Escherichia coli strains

Hsuan-Lin Her et al. Bioinformatics. .

Abstract

Motivation: Antimicrobial resistance (AMR) is becoming a huge problem in both developed and developing countries, and identifying strains resistant or susceptible to certain antibiotics is essential in fighting against antibiotic-resistant pathogens. Whole-genome sequences have been collected for different microbial strains in order to identify crucial characteristics that allow certain strains to become resistant to antibiotics; however, a global inspection of the gene content responsible for AMR activities remains to be done.

Results: We propose a pan-genome-based approach to characterize antibiotic-resistant microbial strains and test this approach on the bacterial model organism Escherichia coli. By identifying core and accessory gene clusters and predicting AMR genes for the E. coli pan-genome, we not only showed that certain classes of genes are unevenly distributed between the core and accessory parts of the pan-genome but also demonstrated that only a portion of the identified AMR genes belong to the accessory genome. Application of machine learning algorithms to predict whether specific strains were resistant to antibiotic drugs yielded the best prediction accuracy for the set of AMR genes within the accessory part of the pan-genome, suggesting that these gene clusters were most crucial to AMR activities in E. coli. Selecting subsets of AMR genes for different antibiotic drugs based on a genetic algorithm (GA) achieved better prediction performances than the gene sets established in the literature, hinting that the gene sets selected by the GA may warrant further analysis in investigating more details about how E. coli fight against antibiotics.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Growth rates of the pan-genome sizes, core gene cluster and accessory gene cluster numbers with the increasing number of E. coli genomes. The blue, orange and green lines, respectively, represent core-, accessory- and pan-genome sizes
Fig. 2.
Fig. 2.
Differences in the COGs functional distributions between the core- and accessory-genomes. COG percentages were estimated by dividing COG numbers by the total gene cluster numbers in either the core- or accessory-genome. Only COGs differing by at least 2-fold between the core and accessory parts were included
Fig. 3.
Fig. 3.
Prediction accuracies of the AMR activities [in terms of the area under the ROCs curve (AUC)] based on the presence/absence patterns of (i) all core and accessory gene clusters (core + acc); (ii) all accessory gene clusters (acc); (iii) accessory gene clusters with CARD annotations (acc/card) and (iv) all CARD gene clusters. The boxplots indicate the distribution of the predictive accuracy of 12 selected drugs (Section 2 and Section 3). The four blocks of boxplots represent four different machine learning algorithms, including Adaboost, NB, RF and SVM, used in the prediction process. Dashed red line indicates 0.9 AUC
Fig. 4.
Fig. 4.
SVM prediction accuracies of the antimicrobial resistance (AMR) activities (in terms of the area under the receiver operating characteristics curve (AUC)) based on 1) 68 accessory genes with CARD annotations (68 acc/card genes); 2) gene clusters selected for each drug based on the genetic algorithm (GA-selected clusters); 3) gene clusters identified by Scoary; and 4) gene clusters with CARD annotations identified by Scoary (Scoary with CARD). The boxplot indicates the distribution of the prediction accuracies for the 12 selected drugs. Dashed red line indicates 0.9 AUC

Similar articles

Cited by

References

    1. Angelova M. et al. (2010) Computational methods for gene finding in prokaryotes In: Gusev M. (ed.) ICT Innovations 2010. Ohrid, Macedonia, Springer, pp. 11–20.
    1. Bradley P. et al. (2015) Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat. Commun., 6, 10063. - PMC - PubMed
    1. Brettin T. et al. (2015) RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci. Rep., 5, 8365. - PMC - PubMed
    1. Brynildsrud O. et al. (2016) Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol., 17, 238.. - PMC - PubMed
    1. Cormican M., Vellinga A. (2012) Existing classes of antibiotics are probably the best we will ever have. Brit. Med. J., 344, e3369.. - PubMed

Publication types

MeSH terms

Substances