Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 24:12:638561.
doi: 10.3389/fmicb.2021.638561. eCollection 2021.

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins

Affiliations

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins

Harihara Subrahmaniam Muralidharan et al. Front Microbiol. .

Abstract

High-throughput sequencing has revolutionized the field of microbiology, however, reconstructing complete genomes of organisms from whole metagenomic shotgun sequencing data remains a challenge. Recovered genomes are often highly fragmented, due to uneven abundances of organisms, repeats within and across genomes, sequencing errors, and strain-level variation. To address the fragmented nature of metagenomic assemblies, scientists rely on a process called binning, which clusters together contigs inferred to originate from the same organism. Existing binning algorithms use oligonucleotide frequencies and contig abundance (coverage) within and across samples to group together contigs from the same organism. However, these algorithms often miss short contigs and contigs from regions with unusual coverage or DNA composition characteristics, such as mobile elements. Here, we propose that information from assembly graphs can assist current strategies for metagenomic binning. We use MetaCarvel, a metagenomic scaffolding tool, to construct assembly graphs where contigs are nodes and edges are inferred based on paired-end reads. We developed a tool, Binnacle, that extracts information from the assembly graphs and clusters scaffolds into comprehensive bins. Binnacle also provides wrapper scripts to integrate with existing binning methods. The Binnacle pipeline can be found on GitHub (https://github.com/marbl/binnacle). We show that binning graph-based scaffolds, rather than contigs, improves the contiguity and quality of the resulting bins, and captures a broader set of the genes of the organisms being reconstructed.

Keywords: binning approach; genome scaffolding; metagenome assembly; metagenomics; strain variation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Schematic diagram of the Binnacle pipeline. Short reads are assembled into contigs with a metagenome assembly tool. These contigs are oriented and ordered to generate graph scaffolds. For each scaffold, based on the length, orientation, and gap estimates, each contig in a scaffold is assigned global start and end coordinates; and the span of the scaffold is computed. Scaffold coverage is the per-base depth of coverage across the scaffold span. In the mis-assembly detection and correction routine, scaffolds are broken up if there are discontinuities in coverage signals. The final set of scaffolds and corresponding coverage information are used as input to binning methods to generate metagenomic bins.
FIGURE 2
FIGURE 2
Assigns start and end coordinates to contigs in a scaffold. The lowest start coordinate and the highest end coordinate determine the scaffold span.
FIGURE 3
FIGURE 3
Pseudocode describing the change-point detection algorithm. The algorithm takes in two parameters α and β denoting the threshold for identifying outliers and the cutoff parameter to delink contigs, respectively.
FIGURE 4
FIGURE 4
The mis-assembly detection algorithm in Binnacle. This is a scaffold from HMP sample SRS012902. The plot on the top shows the position of contigs along the scaffold span. The plot at the bottom shows the per-base depth of coverage across the scaffold span. The locations detected by the change point detection algorithm are highlighted by vertical red lines.
FIGURE 5
FIGURE 5
An example scaffold with coverage estimated with Binnacle. The plot at the top shows the position of contigs along the scaffold span. Contigs within the red dotted box are part of a bubble (signature of strain variation) detected by MetaCarvel. Only three contigs (highlighted in blue color) were binned by MetaBAT2 when contigs rather than scaffolds were provided as input. The plot at the bottom shows the cumulative per-base depth of coverage across the scaffold span as estimated by Binnacle.
FIGURE 6
FIGURE 6
Binning with graph scaffolds improves contiguity, completeness, and contamination in genome bins from the simulated dataset. Comparing bins generated by MetaBAT2 (solid lines) (1), COCACOLA (dotted lines) (2), and MaxBin 2.0 (dashed-dotted lines) (3) using contigs (yellow), linear scaffolds (black), and graph scaffolds (blue) for the simulated dataset. COCACOLA contigs were binned both with and without paired end information. (A) Cumulative base pairs binned with contigs, linear scaffolds, and graph scaffolds. Bins are ordered in decreasing order of their size. The upper curve corresponds to higher contiguity for the same number of bins. (B) Completeness is defined as the percentage of the assigned genome represented in the bin. Bins are ordered in decreasing order of their completeness value. The upper curve indicates that more base pairs are binned by graph scaffolds at the same or higher level of completeness. (C) Contamination of a bin is defined as the percentage of base pairs that did not align to the assigned genome. Bins are ordered in the increasing order of their contamination value. The higher curve indicates that more base pairs are binned by graph scaffolds at the same or lower level of contamination.
FIGURE 7
FIGURE 7
Graph scaffolds bin more contigs and reduce bin contamination in the HMP gut dataset. Comparing bins generated by MetaBAT2 using contigs, linear scaffolds, and graph scaffolds for the HMP gut dataset. The completeness and contamination of bins were evaluated with CheckM. (A) Cumulative base pairs binned with contigs, linear scaffolds, and graph scaffolds. Bins are ordered in decreasing order of their size. The upper curve corresponds to higher contiguity for the same number of bins. (B) Bins are ordered in decreasing order of their completeness value from CheckM evaluation. The upper curve indicates that more bins are at the same or higher level of completeness. (C) Bins are ordered in the increasing order of their contamination value from CheckM evaluation. The higher curve indicates that more bins are at the same or lower level of contamination.
FIGURE 8
FIGURE 8
Cutibacterium bins generated by graph scaffolds capture more auxiliary genome elements. Genes predicted from C. acnes bins were mapped to genes from the C. acnes pangenome and characterized as core, accessory, or putative-accessory. The x-axis denotes the number of genes in all of the C. acnes bins and the y-axis denotes the method by which each gene was binned. The label denotes the total number of genes in each bar. In (A) all genes binned by each method are included in the bars, while in (B) they are separated by how they are shared across binning methods.
FIGURE 9
FIGURE 9
Cutibacterium bins in sample MET0773. (A) Ordered lengths of graph scaffolds (top), linear scaffolds (middle) and contigs (bottom) included in C. acnes bins, highlighting the greater fragmentation in the bin generated using contigs. Red boxes highlight graph scaffolds depicted in parts (BD). In (B–D), the large arrows represent contigs in a single graph scaffold. Lines connecting contigs denote paired-end read support. Contigs are colored to indicate the methods that include them in the C. acnes bins. Scaffold plots were generated by MetagenomeScope (Fedarko et al., 2017) but updated and modified to improve visualization in Illustrator. Genes in contigs uniquely binned by graph scaffolds are depicted below the scaffold as thin arrows. Genes were predicted and annotated by Prokka (Seemann, 2014) and visualized with the R package genoPlotR (Guy et al., 2010).

References

    1. Adams R. P., MacKay D. J. C. (2007). Bayesian Online Changepoint Detection. arXiv [stat.ML]. Available online at: http://arxiv.org/abs/0710.3742 (accessed May 11, 2020).
    1. Albertsen M., Hugenholtz P., Skarshewski A., Nielsen K. L., Tyson G. W., Nielsen P. H. (2013). Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31 533–538. 10.1038/nbt.2579 - DOI - PubMed
    1. Alneberg J., Bjarnason B. S., de Bruijn I., Schirmer M., Quick J., Ijaz U. Z., et al. (2014). Binning metagenomic contigs by coverage and composition. Nat. Methods 11 1144–1146. 10.1038/nmeth.3103 - DOI - PubMed
    1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215 403–410. - PubMed
    1. Aminikhanghahi S., Cook D. J. (2017). A survey of methods for time series change point detection. Knowl. Inf. Syst. 51 339–367. 10.1007/s10115-016-0987-z - DOI - PMC - PubMed