Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;12 Suppl 2(Suppl 2):S8.
doi: 10.1186/1471-2164-12-S2-S8. Epub 2011 Jul 27.

Evaluation of short read metagenomic assembly

Affiliations

Evaluation of short read metagenomic assembly

Anveshi Charuvaka et al. BMC Genomics. 2011.

Abstract

Background: Metagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms. The problem becomes even more difficult when short reads produced by next generation sequencing technologies are used. Although whole genome assemblers are not designed to assemble metagenomic samples, they are being used for metagenomics due to the lack of assemblers capable of dealing with metagenomic samples. We present an evaluation of assembly of simulated short-read metagenomic samples using a state-of-art de Bruijn graph based assembler.

Results: We assembled simulated metagenomic reads from datasets of various complexities using a state-of-art de Bruijn graph based parallel assembler. We have also studied the effect of k-mer size used in de Bruijn graph on metagenomic assembly and developed a clustering solution to pool the contigs obtained from different assembly runs, which allowed us to obtain longer contigs. We have also assessed the degree of chimericity of the assembled contigs using an entropy/impurity metric and compared the metagenomic assemblies to assemblies of isolated individual source genomes.

Conclusions: Our results show that accuracy of the assembled contigs was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains. Clustering contigs from different k-mer parameter of the de Bruijn graph allowed us to obtain longer contigs, however the clustering resulted in accumulation of erroneous contigs thus increasing the error rate in clustered contigs.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Contig length distribution The total number of contigs (in log scale) shorter than a given cutoff length, for each of the datasets. (A) simLC. (B) simMC. (C) simHC. (D) Comparison of length distribution of the clustered contigs for metagenomic and isolate assemblies.
Figure 2
Figure 2
Total bases recovered at different contig length cutoffs The total number of bases contained in all the contigs shorter than a certain cutoff length (A) simLC. (B) simMC. (C) simHC.
Figure 3
Figure 3
Contig entropy at different taxonomic levels The entropy of contigs versus the contig length (in log scale) for the datasets (A) simLC. (B) simMC. (C) simHC.
Figure 4
Figure 4
Coverage of source sequences from the clustered contigs.
Figure 5
Figure 5
Contig length distribution of EcoliStrains dataset and isolate assembly. The total number of contigs (in log scale) shorter than a given cutoff length
Figure 6
Figure 6
Coverage of source sequences for EColi Strains
Figure 7
Figure 7
Read coverage distribution Distribution of the sampling depth of each genome in the datasets simLC, simMC and simHC.

References

    1. Tyson GW, Hugenholtz P. Metagenomics. Nature Reviews Microbiology. 2008. http://www.nature.com/nature/journal/v455/n7212/full/455481a.html - PubMed
    1. Venter J, Remington K, Heidelberg J, Halpern A, Rusch D, Eisen J, Wu D, Paulsen I, Nelson K, Nelson W. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304(5667):66. doi: 10.1126/science.1093857. - DOI - PubMed
    1. Rusch D, Halpern A, Sutton G, Heidelberg K, Williamson S, Yooseph S, Wu D, Eisen J, Hoffman J, Remington K. et al. The Sorcerer II global ocean sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol. 2007;5(3):e77. doi: 10.1371/journal.pbio.0050077. - DOI - PMC - PubMed
    1. Qin J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):59–65. doi: 10.1038/nature08821. - DOI - PMC - PubMed
    1. Mavromatis K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007;4(6):495–500. doi: 10.1038/nmeth1043. - DOI - PubMed

Publication types

LinkOut - more resources