Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Sep 16;8(1):22.
doi: 10.1186/1748-7188-8-22.

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Affiliations

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Rayan Chikhi et al. Algorithms Mol Biol. .

Abstract

Background: The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB).

Results: We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives.

Conclusions: An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A complete example of removing false positives in the probabilistic de Bruijn graph. (a) showsS, an example de Bruijn graph (the 7 non-dashed nodes), and, its probabilistic representation from a Bloom filter (taking the union of all nodes). Dashed rectangular nodes (in red in the electronic version) are immediate neighbors ofS in. These nodes are the critical false positives. Dashed circular nodes (in green) are all the other nodes of; (b) shows a sample of the hash values associates to the nodes ofS (a toy hash function is used); (c) shows the complete Bloom filter associated toS; incidentally, the nodes of are exactly those to which the Bloom filter answers positively; (d) describes the lower bound for exactly encoding the nodes ofS (self-information) and the space required to encode our structure (Bloom filter, 10 bits, and 3 critical false positives, 6 bits per 3-mer).
Figure 2
Figure 2
Optimal data structure size for the parameterr, then for the parameterk.(a) Structure size (Bloom filter, critical false positives) in function of the number of bits per k-mer allocated to the Bloom filter (also called ratio r) for k = 27. The trade-off that optimizes the total size is shown in dashed lines. (b) Optimal size of the structure for different values of k.
Figure 3
Figure 3
Data structure sizes for the probabilistic de Bruijn graph. Data structure sizes (Bloom filter, marking structure, and cFP if applicable) for the probabilistic de Bruijn graph with (top right) and without the cFP structure (top left), for an actual dataset (E. coli, k = 23). All plots are in function of the number of bits per k-mer allocated to the Bloom filter. Additionally, the difference is shown (bottom left and bottom right) between a reference assembly made using an exact de Bruijn graph, and an assembly made with each structure.

References

    1. Idury RM, Waterman MS. A new algorithm for DNA sequence assembly. J Comput Biol. 1995;8(2):291–306. doi: 10.1089/cmb.1995.2.291. - DOI - PubMed
    1. Grabherr MG. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech. 2011;8(7):644–652. doi: 10.1038/nbt.1883. [ http://dx.doi.org/10.1038/nbt.1883] - DOI - DOI - PMC - PubMed
    1. Peng Y, Leung HCM, Yiu SM, Chin FYL. Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics. 2011;8(13):i94–i101. doi: 10.1093/bioinformatics/btr216. - DOI - PMC - PubMed
    1. Peterlongo P, Schnel N, Pisanti N, Sagot MF, Lacroix V. String Processing and Information Retrieval. Berlin, Heidelberg: Springer; 2010. Identifying SNPs without a reference genome by comparing raw reads; pp. 147–158.
    1. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012;8:226–232. doi: 10.1038/ng.1028. - DOI - PMC - PubMed