Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Aug 14;109(33):13272-7.
doi: 10.1073/pnas.1121464109. Epub 2012 Jul 30.

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Affiliations

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Jason Pell et al. Proc Natl Acad Sci U S A. .

Abstract

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Graph visualizations demonstrating the decreasing fidelity of graph structure with increasing false positive rate. Erroneous k-mers are colored red and k-mers corresponding to the original generated sequence (1,000 31-mers generated by a 1,031 bp circular chromosome) are black. From top left to bottom right, the false positive rates are 0.01, 0.05, 0.10, and 0.15. Shortcuts “across” the graph are not created.
Fig. 2.
Fig. 2.
Average component size versus false positive rate. The average component size sharply increases as the false positive rate approaches the percolation threshold.
Fig. 3.
Fig. 3.
The diameter of randomly generated 58 bp long circular chromosomes in 8-mer (i.e., a cycle of 50 8-mers) space remains constant for false positive rates up through 18.3%. Only real (nonerror) k-mers are considered for starting and ending points.
Fig. 4.
Fig. 4.
Comparison between Bloom filters at different false positive rates with the information-theoretic lossless lower bound at different k values. Bloom filters are k independent and are more efficient than any lossless data structure for higher k due to greater sparseness in k-mers inserted compared to all possible k-mers.

References

    1. Pop M. Genome assembly reborn: Recent computational challenges. Brief Bioinform. 2009;10:354–366. - PMC - PubMed
    1. Salzberg S, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22:557–567. - PMC - PubMed
    1. Qin J, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65. - PMC - PubMed
    1. Hess M, et al. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 2011;331:463–467. - PubMed
    1. Wooley J, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6:e1000667. - PMC - PubMed

Publication types

LinkOut - more resources