Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 6:14:1267695.
doi: 10.3389/fmicb.2023.1267695. eCollection 2023.

plASgraph2: using graph neural networks to detect plasmid contigs from an assembly graph

Affiliations

plASgraph2: using graph neural networks to detect plasmid contigs from an assembly graph

Janik Sielemann et al. Front Microbiol. .

Abstract

Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread and other One-Health issues. We provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. We employ graph neural networks (GNNs) and the assembly graph to propagate the information from nearby nodes, which leads to more accurate classification, especially for short contigs that are difficult to classify based on sequence features or database searches alone. We trained plASgraph2 on a data set of samples from the ESKAPEE group of pathogens. plASgraph2 either outperforms or performs on par with a wide range of state-of-the-art methods on testing sets of independent ESKAPEE samples and samples from related pathogens. On one hand, our study provides a new accurate and easy to use tool for contig classification in bacterial isolates; on the other hand, it serves as a proof-of-concept for the use of GNNs in genomics. Our software is available at https://github.com/cchauve/plasgraph2 and the training and testing data sets are available at https://github.com/fmfi-compbio/plasgraph2-datasets.

Keywords: assembly graph; bioinformatics; classification; machine learning (ML); plasmids.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Model architecture of plASgraph2. The model takes as input the assembly graph structure and six features per node (contig). The core of the network is composed of six graph convolutional layers. The model generates two outputs per node, which facilitate the classification of plasmids and chromosomes as two separate classification tasks.
Figure 2
Figure 2
Receiver operating characteristic curves for all contigs in the ESKAPEE test set considering isolates with maximally 100, 200, 300, or 10,000 contigs. ROC curves are not calculated for Platon and PlasForest tools, as those tools do not provide confidence scores as output. In total, the ESKAPEE test set consists of 224 samples; thus almost half of those short read assemblies contain 100 or fewer contigs.
Figure 3
Figure 3
Comparison of F1-scores using samples of evolutionarily close non-ESKAPEE species considering all contigs longer than 100 bp. Each datapoint represents the F1-score of a single isolate. Median is shown as a horizontal line.
Figure 4
Figure 4
Contig classification in the context of the assembly graph of C. freundii isolate SAMN15148288. Chromosomal contigs are colored in blue and ambiguous contigs are colored in black. (Left) The ground-truth, including two different plasmids (green and red). (Middle) plASgraph2 predictions. (Right) PlasForest predictions. Note that, the classification tasks do not include binning of contig plasmids, thus all predicted plasmid contigs are colored in green. The assembly graph extends to the upper left as a loop of chromosomal contigs alternating with unlabeled SNPs, which is not shown.

References

    1. Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., et al. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Available online at: tensorflow.org
    1. Acman M., Wang R., van Dorp L., Shaw L. P., Wang Q., Luhmann N., et al. (2022). Role of mobile genetic elements in the global dissemination of the carbapenem resistance gene blaNDM. Nat. Commun. 13, 1131. 10.1038/s41467-022-28819-2 - DOI - PMC - PubMed
    1. Andreopoulos W. B., Geller A. M., Lucke M., Balewski J., Clum A., Ivanova N. N., et al. (2021). Deeplasmid: deep learning accurately separates plasmids from bacterial chromosomes. Nucleic Acids Res. 50, e17. 10.1093/nar/gkab1115 - DOI - PMC - PubMed
    1. Arredondo-Alonso S., Bootsma M., Hein Y., Rogers M. R. C., Corander J., Willems R. J. L., et al. (2020). gplas: a comprehensive tool for plasmid analysis using short-read graphs. Bioinformatics 36, 3874–3876. 10.1093/bioinformatics/btaa233 - DOI - PMC - PubMed
    1. Arredondo-Alonso S., Rogers M. R., Braat J. C., Verschuuren T. D., Top J., Corander J., et al. (2018). mlplasmids: a user-friendly tool to predict plasmid-and chromosome-derived sequences for single species. Microb. Genom. 4, e000224. 10.1099/mgen.0.000224 - DOI - PMC - PubMed

LinkOut - more resources