Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 26;22(1):349.
doi: 10.1186/s12859-021-04270-w.

PlasForest: a homology-based random forest classifier for plasmid detection in genomic datasets

Affiliations

PlasForest: a homology-based random forest classifier for plasmid detection in genomic datasets

Léa Pradier et al. BMC Bioinformatics. .

Abstract

Background: Plasmids are mobile genetic elements that often carry accessory genes, and are vectors for horizontal transfer between bacterial genomes. Plasmid detection in large genomic datasets is crucial to analyze their spread and quantify their role in bacteria adaptation and particularly in antibiotic resistance propagation. Bioinformatics methods have been developed to detect plasmids. However, they suffer from low sensitivity (i.e., most plasmids remain undetected) or low precision (i.e., these methods identify chromosomes as plasmids), and are overall not adapted to identify plasmids in whole genomes that are not fully assembled (contigs and scaffolds).

Results: We developed PlasForest, a homology-based random forest classifier identifying bacterial plasmid sequences in partially assembled genomes. Without knowing the taxonomical origin of the samples, PlasForest identifies contigs as plasmids or chromosomes with a F1 score of 0.950. Notably, it can detect 77.4% of plasmid contigs below 1 kb with 2.8% of false positives and 99.9% of plasmid contigs over 50 kb with 2.2% of false positives.

Conclusions: PlasForest outperforms other currently available tools on genomic datasets by being both sensitive and precise. The performance of PlasForest on metagenomic assemblies are currently well below those of other k-mer-based methods, and we discuss how homology-based approaches could improve plasmid detection in such datasets.

Keywords: Genomic datasets; Homology; Plasmid identification; Random forest classifier.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
General method of classification implemented in PlasForest
Fig. 2
Fig. 2
Chosen features and their importances in the classification process. A Schematic representation of the features extracted from contigs, including homology-based features (number of hits, maximum overlap, average overlap, median overlap, variance of overlaps, contig size) and sequence-based feature (G + C content). B Impurity-based feature importance computed with scikit-learn library for the seven features kept in the classifier
Fig. 3
Fig. 3
Sensitivity of PlasForest to resampling. A Performances after 50 resampling of the balanced training set. B Performances after 50 resampling of the plasmid database. The initial performances of PlasForest on the testing set are displayed with red dots. The distribution of performances for PlasForest when resampling 50 times either the plasmid database or the balanced training set are displayed in grey boxes
Fig. 4
Fig. 4
Compared performances of PlasForest and 4 other plasmid identification methods on the testing set
Fig. 5
Fig. 5
Agreement of plasmid identification for PlasForest and 4 other plasmid identification methods on the CONTIG and METAGENOME datasets. A Number of contigs identified as plasmids in the CONTIG dataset. B Number of contigs identified as plasmids in the METAGENOME dataset. The CONTIG dataset gathers 151,634 contigs collected from 1328 partially assembled genomes. The METAGENOME dataset gathers 143,663 contigs collected from 1000 partially assembled genomes drawn from metagenomic datasets
Fig. 6
Fig. 6
Datasets and application of a hold-out method for supervised learning. Schematic representation of the processes that allow to generate the datasets used to build PlasForest and to benchmark its performances. A 10,152 bacterial genomes from NCBI Refseq Genomes FTP server were randomly cut into contigs, and were distributed into the following datasets: the (balanced) training set contains 70% of the initial 10,152 genomes assemblies and it is used to train the random forest classifier; the testing set contains 30% of the genomes. B Other genome assemblies were drawn from more recent releases of NCBI Refseq Genomes FTP or from other sources to build the COMGENOME, CONTIG, and METAGENOME datasets. With the testing set, they are used to benchmark the performance of PlasForest compared to other plasmid identification methods

Similar articles

Cited by

References

    1. Elwell LP, Shipley PL. Plasmid-mediated factors associated with virulence of bacteria to animals. Annu Rev Microbiol. 1980;34:465–496. doi: 10.1146/annurev.mi.34.100180.002341. - DOI - PubMed
    1. Johnson TJ, Logue CM, Johnson JR, Kuskowski MA, Sherwood JS, Barnes HJ, et al. Associations between multidrug resistance, plasmid content, and virulence potential among extraintestinal pathogenic and commensal Escherichia coli from humans and poultry. Foodborne Pathog Dis. 2012;9:37–46. doi: 10.1089/fpd.2011.0961. - DOI - PMC - PubMed
    1. Poolkhet C, Chumsing S, Wajjwalku W, Minato C, Otsu Y, Takai S. Plasmid profiles and prevalence of intermediately virulent rhodococcus equi from pigs in Nakhonpathom Province, Thailand: Identification of a new variant of the 70-kb virulence plasmid, type 18. Vet Med Int. 2010;2010. - PMC - PubMed
    1. Costa R, Götz M, Mrotzek N, Lottmann J, Berg G, Smalla K. Effects of site and plant species on rhizosphere community structure as revealed by molecular analysis of microbial guilds. FEMS Microbiol Ecol. 2006;56:236–249. doi: 10.1111/j.1574-6941.2005.00026.x. - DOI - PubMed
    1. Heuer H, Binh CTT, Jechalke S, Kopmann C, Zimmerling U, Krögerrecklenfort E, et al. IncP-1ε plasmids are important vectors of antibiotic resistance genes in agricultural systems: diversification driven by class 1 integron gene cassettes. Front Microbiol. 2012 doi: 10.3389/fmicb.2012.00002. - DOI - PMC - PubMed

LinkOut - more resources