Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 15;11(3):e0464522.
doi: 10.1128/spectrum.04645-22. Epub 2023 May 16.

Plasmer: an Accurate and Sensitive Bacterial Plasmid Prediction Tool Based on Machine Learning of Shared k-mers and Genomic Features

Affiliations

Plasmer: an Accurate and Sensitive Bacterial Plasmid Prediction Tool Based on Machine Learning of Shared k-mers and Genomic Features

Qianhui Zhu et al. Microbiol Spectr. .

Abstract

Identification of plasmids in bacterial genomes is critical for many factors, including horizontal gene transfer, antibiotic resistance genes, host-microbe interactions, cloning vectors, and industrial production. There are several in silico methods to predict plasmid sequences in assembled genomes. However, existing methods have evident shortcomings, such as unbalance in sensitivity and specificity, dependency on species-specific models, and performance reduction in sequences shorter than 10 kb, which has limited their scope of applicability. In this work, we proposed Plasmer, a novel plasmid predictor based on machine-learning of shared k-mers and genomic features. Unlike existing k-mer or genomic-feature based methods, Plasmer employs the random forest algorithm to make predictions using the percent of shared k-mers with plasmid and chromosome databases combined with other genomic features, including alignment E value and replicon distribution scores (RDS). Plasmer can predict on multiple species and has achieved an average the area under the curve (AUC) of 0.996 with accuracy of 98.4%. Compared to existing methods, tests of both sliding sequences and simulated and de novo assemblies have consistently shown that Plasmer has outperforming accuracy and stable performance across long and short contigs above 500 bp, demonstrating its applicability for fragmented assemblies. Plasmer also has excellent and balanced performance on both sensitivity and specificity (both >0.95 above 500 bp) with the highest F1-score, which has eliminated the bias on sensitivity or specificity that was common in existing methods. Plasmer also provides taxonomy classification to help identify the origin of plasmids. IMPORTANCE In this study, we proposed a novel plasmid prediction tool named Plasmer. Technically, unlike existing k-mer or genomic features-based methods, Plasmer is the first tool to combine the advantages of the percent of shared k-mers and the alignment score of genomic features. This has given Plasmer (i) evident improvement in performance compared to other methods, with the best F1-score and accuracy on sliding sequences, simulated contigs, and de novo assemblies; (ii) applicability for contigs above 500 bp with highest accuracy, enabling plasmid prediction in fragmented short-read assemblies; (iii) excellent and balanced performance between sensitivity and specificity (both >0.95 above 500 bp) with the highest F1-score, which eliminated the bias on sensitivity or specificity that commonly existed in other methods; and (iv) no dependency of species-specific training models. We believe that Plasmer provides a more reliable alternative for plasmid prediction in bacterial genome assemblies.

Keywords: bacteria; benchmark; chromosome; genomic features; k-mer; machine learning; plasmid; prediction tool; random forest; shared k-mers.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

FIG 1
FIG 1
The performance of the classified model based on random forest analysis. (A) The estimation of the number of trees in the model. The x axis represents the number of trees in the model; the y axis represents the error rate of the model. OOB, out of bag. (B) The importance of all features. The x axis represents all the features; the y axis represents the importance. Green means the feature is important for classifying. (C) The performance indicators of the model during 100 random time points. Different colors represent different indicators. (D) The ROC curve of the model. The x axis represents 1 specificity; the y axis represents the sensitivity. The area under the curve (AUC) is 0.996.
FIG 2
FIG 2
The performance of Plasmer and other methods for the remaining sliding sequences with different lengths of 31,897 complete bacterial genomes in NCBI. Each subpanel represents one indicator, such as sensitivity. The x axis represents lengths of the sequences; the y axis represents the values of indicators. Different colors represent different indicators. (A) The performance on all remaining sliding sequences. (B) The performance on resampled sliding sequences with balanced numbers of chromosomes and plasmids.
FIG 3
FIG 3
The performance of Plasmer and other methods for contigs assembled by simulated reads of Klebsiella pneumoniae, Escherichia coli, and Enterococcus faecium. Each subpanel represents one indicator, such as sensitivity. The x axis represents the lengths of the sequences; the y axis represents the values of indicators. Different colors represent different indicators. (A) The 552 complete genomes of Klebsiella pneumoniae. (B) The 1,718 complete genomes of Escherichia coli. (C) The 123 complete genomes of Enterococcus faecium.
FIG 4
FIG 4
The performance of Plasmer and other methods for contigs assembled by real sequencing reads. Each subpanel represents one indicator, such as sensitivity. The x axis represents the lengths of the sequences; the y axis represents the values of the indicators. Different colors represent different indicators. (A) The sequencing reads of 41 isolates used in the mlplasmids (21) publication. (B) The sequencing reads of 40 isolates used in the PlasFlow (20) publication. (C) The sequencing reads of 21 new isolates used in the Platon (18) publication. (D) The performance of Plasmer and other methods on the assembled contigs of 535 recently published genomes of multiple species.
FIG 5
FIG 5
The workflow of Plasmer. The nc and np boxes represent the chromosomes and plasmids, respectively, of 31,897 complete bacterial genomes in NCBI. The r, p, pmr, and rmp boxes represent the k-mer database from chromosomes of representative genomes, the k-mer database from plasmids of PLSDB, the plasmid-unique k-mer database (plasmid database minus k-mers from chromosomes of representative genomes), and the chromosome-unique k-mer database (representative chromosome database minus k-mers from PLSDB), respectively.

Similar articles

Cited by

References

    1. Clewell DB, Weaver KE, Dunny GM, Coque TM, Francia MV, Hayes F. 2014. Extrachromosomal and mobile elements in enterococci: transmission, maintenance, and epidemiology. In Gilmore MS, Clewell DB, Ike Y, Shankar N (ed), Enterococci: from commensals to leading causes of drug resistant infection. Massachusetts Eye and Ear Infirmary, Boston, MA. - PubMed
    1. Tazzyman SJ, Bonhoeffer S. 2015. Why there are no essential genes on plasmids. Mol Biol Evol 32:3079–3088. doi:10.1093/molbev/msu293. - DOI - PubMed
    1. Thomas CM, Nielsen KM. 2005. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat Rev Microbiol 3:711–721. doi:10.1038/nrmicro1234. - DOI - PubMed
    1. Smillie C, Garcillan-Barcia MP, Francia MV, Rocha EP, de la Cruz F. 2010. Mobility of plasmids. Microbiol Mol Biol Rev 74:434–452. doi:10.1128/MMBR.00020-10. - DOI - PMC - PubMed
    1. Smalla K, Jechalke S, Top EM. 2015. Plasmid detection, characterization, and ecology. Microbiol Spectr 3:PLAS-0038-2014. doi:10.1128/microbiolspec.PLAS-0038-2014. - DOI - PMC - PubMed

Publication types

LinkOut - more resources