Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Oct 1:4:6480.
doi: 10.1038/srep06480.

Improved assemblies using a source-agnostic pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of contigs

Affiliations

Improved assemblies using a source-agnostic pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of contigs

Matthew Scholz et al. Sci Rep. .

Abstract

Assembly of metagenomic samples is a very complex process, with algorithms designed to address sequencing platform-specific issues, (read length, data volume, and/or community complexity), while also faced with genomes that differ greatly in nucleotide compositional biases and in abundance. To address these issues, we have developed a post-assembly process: MetaGenomic Assembly by Merging (MeGAMerge). We compare this process to the performance of several assemblers, using both real, and in-silico generated samples of different community composition and complexity. MeGAMerge consistently outperforms individual assembly methods, producing larger contigs with an increased number of predicted genes, without replication of data. MeGAMerge contigs are supported by read mapping and contig alignment data, when using synthetically-derived and real metagenomic data, as well as by gene prediction analyses and similarity searches. MeGAMerge is a flexible method that generates improved metagenome assemblies, with the ability to accommodate upcoming sequencing platforms, as well as present and future assembly algorithms.

PubMed Disclaimer

Figures

Figure 1
Figure 1. MeGAMerge pipeline for metagenomes.
This diagram provides an overview of the MeGAMerge process, including optional steps for trimming sequencing data and the inclusion of optional assemblers for Illumina reads. Long read or contig sets may include Sanger libraries, error-corrected PacBio reads (raw reads are likely to be too error-prone to be merged), and any other source of contigs. Input sequences of size < 200 bp are removed from this method, but this default value can be changed. The MeGAMerge pipeline currently uses Newbler to assemble short contigs, and Minimus2 as the final assembly stage.
Figure 2
Figure 2. MeGAMerge contigs encompass many smaller contigs from individual assemblies.
The largest MeGAMerge contigs produced using input assemblies from SOAP-denovo with Kmer ranges of 21–31 (A), or 85–99 (B), are shown for sample SRS022071. The underlying contig coverage is indicated with all contigs from the individual assemblies that are aligned to the MeGAMerge contigs (green and red lines indicate the 5′-3′ orientation of the original contigs). The read coverage is also shown using a sliding window of 100 bp. Both contig and read coverage support the MeGAMerge produced contigs.
Figure 3
Figure 3. Comparison of Statistical Metrics of Assembly for HMP and Oil Spill data.
Panel 3A shows the results of various assemblers compared to MeGAMerge for the average contig size (x-axis) and the total assembled bases (y-axis). MeGAMerge performs better than all other assemblers. Panel 3B shows the same graph for assembly of the oil spill sample. There is less uniformity for this sample, but MeGAMerge continues to produce more bases at a large average contig size.
Figure 4
Figure 4. Read-mapping validation of HMP and Oil Spill produced contigs.
Percent coverage (y-axis) versus size of contig (x-axis) for MeGAMerge (black) and a single Ray Assembly (red) are displayed for HMP sample SRS022071 (A), and the oil spill sample (B). MeGAMerge contigs follow a similar pattern as with Ray contigs, with larger contigs that are validated by reads.

References

    1. Scholz M. B. et al. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr Opin Biotech 23, 9–15, 10.1016/j.copbio.2011.11.013 (2012). - PubMed
    1. Miller J. R. et al. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327, 10.1016/j.ygeno.2010.03.001 (2010). - PMC - PubMed
    1. Earl D. et al. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Res 21, 2224–2241, 10.1101/gr.126599.111 (2011). - PMC - PubMed
    1. Pell J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. P Natl Acad Sci USA 109, 13272–13277, 10.1073/pnas.1121464109 (2012). - PMC - PubMed
    1. Desai N. et al. From genomics to metagenomics. Curr Opin Biotech 23, 72–76, 10.1016/j.copbio.2011.12.017 (2012). - PubMed

Publication types