Improved assemblies using a source-agnostic pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of contigs

Matthew Scholz¹, Chien-Chi Lo¹, Patrick S G Chain¹

Affiliations

Affiliation

¹ 1] Genome Science Group, Los Alamos National Laboratory, Los Alamos, NM 87545 [2] Microbial and Metagenome Program, Joint Genome Institute, Walnut Creek, CA 94598.

PMID: 25270300
PMCID: PMC4180827
DOI: 10.1038/srep06480

Improved assemblies using a source-agnostic pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of contigs

Matthew Scholz et al. Sci Rep. 2014.

. 2014 Oct 1:4:6480.

doi: 10.1038/srep06480.

Authors

Matthew Scholz¹, Chien-Chi Lo¹, Patrick S G Chain¹

Affiliation

¹ 1] Genome Science Group, Los Alamos National Laboratory, Los Alamos, NM 87545 [2] Microbial and Metagenome Program, Joint Genome Institute, Walnut Creek, CA 94598.

PMID: 25270300
PMCID: PMC4180827
DOI: 10.1038/srep06480

Abstract

Assembly of metagenomic samples is a very complex process, with algorithms designed to address sequencing platform-specific issues, (read length, data volume, and/or community complexity), while also faced with genomes that differ greatly in nucleotide compositional biases and in abundance. To address these issues, we have developed a post-assembly process: MetaGenomic Assembly by Merging (MeGAMerge). We compare this process to the performance of several assemblers, using both real, and in-silico generated samples of different community composition and complexity. MeGAMerge consistently outperforms individual assembly methods, producing larger contigs with an increased number of predicted genes, without replication of data. MeGAMerge contigs are supported by read mapping and contig alignment data, when using synthetically-derived and real metagenomic data, as well as by gene prediction analyses and similarity searches. MeGAMerge is a flexible method that generates improved metagenome assemblies, with the ability to accommodate upcoming sequencing platforms, as well as present and future assembly algorithms.

PubMed Disclaimer

Figures

**Figure 1. MeGAMerge pipeline for metagenomes.**
This diagram provides an overview of the MeGAMerge process, including optional steps for trimming sequencing data and the inclusion of optional assemblers for Illumina reads. Long read or contig sets may include Sanger libraries, error-corrected PacBio reads (raw reads are likely to be too error-prone to be merged), and any other source of contigs. Input sequences of size < 200 bp are removed from this method, but this default value can be changed. The MeGAMerge pipeline currently uses Newbler to assemble short contigs, and Minimus2 as the final assembly stage.

**Figure 2. MeGAMerge contigs encompass many smaller contigs from individual assemblies.**
The largest MeGAMerge contigs produced using input assemblies from SOAP-denovo with Kmer ranges of 21–31 (A), or 85–99 (B), are shown for sample SRS022071. The underlying contig coverage is indicated with all contigs from the individual assemblies that are aligned to the MeGAMerge contigs (green and red lines indicate the 5′-3′ orientation of the original contigs). The read coverage is also shown using a sliding window of 100 bp. Both contig and read coverage support the MeGAMerge produced contigs.

**Figure 3. Comparison of Statistical Metrics of Assembly for HMP and Oil Spill data.**
Panel 3A shows the results of various assemblers compared to MeGAMerge for the average contig size (x-axis) and the total assembled bases (y-axis). MeGAMerge performs better than all other assemblers. Panel 3B shows the same graph for assembly of the oil spill sample. There is less uniformity for this sample, but MeGAMerge continues to produce more bases at a large average contig size.

**Figure 4. Read-mapping validation of HMP and Oil Spill produced contigs.**
Percent coverage (y-axis) versus size of contig (x-axis) for MeGAMerge (black) and a single Ray Assembly (red) are displayed for HMP sample SRS022071 (A), and the oil spill sample (B). MeGAMerge contigs follow a similar pattern as with Ray contigs, with larger contigs that are validated by reads.

See this image and copyright information in PMC

References

1. Scholz M. B. et al. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr Opin Biotech 23, 9–15, 10.1016/j.copbio.2011.11.013 (2012). - PubMed
1. Miller J. R. et al. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327, 10.1016/j.ygeno.2010.03.001 (2010). - PMC - PubMed
1. Earl D. et al. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Res 21, 2224–2241, 10.1101/gr.126599.111 (2011). - PMC - PubMed
1. Pell J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. P Natl Acad Sci USA 109, 13272–13277, 10.1073/pnas.1121464109 (2012). - PMC - PubMed
1. Desai N. et al. From genomics to metagenomics. Curr Opin Biotech 23, 72–76, 10.1016/j.copbio.2011.12.017 (2012). - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improved assemblies using a source-agnostic pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of contigs

Affiliation

Improved assemblies using a source-agnostic pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of contigs

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources