Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Nov 8;29(11):987-91.
doi: 10.1038/nbt.2023.

How to apply de Bruijn graphs to genome assembly

Affiliations

How to apply de Bruijn graphs to genome assembly

Phillip E C Compeau et al. Nat Biotechnol. .
No abstract available

PubMed Disclaimer

Figures

Figure 1
Figure 1. Bridges of Konigsberg problem
(a) A map of old Königsberg, in which each area of the city is labeled with a differently colored point. (b) The Königsberg Bridge Graph, formed by representing each of four land areas as a node and each of the city’s seven bridges as an edge.
Figure 2
Figure 2. Two strategies for genome assembly: from Hamiltonian cycles to Eulerian cycles
(a) A simplified example of a small circular genome. (b) In traditional Sanger sequencing algorithms, reads were represented as nodes in a graph, and edges represented alignments between reads. Walking along a Hamiltonian cycle by following the edges in numerical order allows one to reconstruct the circular genome by combining alignments between successive reads. At the end of the cycle, the sequence wraps around to the start of the genome; the repeated part of the sequence is grayed out in the alignment diagram. (c) An alternative assembly technique first splits reads into all possible k-mers: with k = 3, “ATGGCGT” comprises ATG, TGG, GGC, GCG and CGT. Following a Hamiltonian cycle (indicated by red edges) allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive nodes) is shifted by one position. This procedure recovers the genome but does not scale well to large graphs. (d) Modern short-read assembly algorithms construct a de Bruijn graph by representing all k-mer prefixes and suffixes as nodes, then drawing edges that represent k-mers having a particular prefix and suffix. For example, k-mer edge ATG has prefix AT and suffix TG. Finding an Eulerian cycle allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive edges) is shifted by one position. This generates the same cyclic genome sequence without the computational strain of finding a Hamiltonian cycle.
Figure 3
Figure 3. De Bruijn graph
The de Bruijn graph B for k = 4 and a 2-character alphabet composed of the digits 0 and 1. This graph has an Eulerian cycle since each node has indegree and outdegree equal to 2. Following the blue numbered edges in order 1, 2, …, 16 gives an Eulerian cycle 0000, 0001, 0011, 0110, 1100, 1001, 0010, 0101, 1011, 0111, 1111, 1110, 1101, 1010, 0100, 1000, which spells the cyclic superstring 0000110010111101.

References

    1. Euler L. Solutio problematis ad geometriam situs pertinentis. Commentarii Academiae Scientiarum Petropolitanae. 1741;8:128–140.
    1. Skiena S. The Algorithm Design Manual. Berlin: Springer; 2008.
    1. Lander E, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
    1. Venter J, et al. The sequence of the human genome. Science. 2001;291:1304–1351. - PubMed
    1. Kececioglu J, Myers E. Combinatorial algorithms for DNA sequence assembly. Algorithmica. 1995;13:7–51.

Publication types