Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 15;30(24):3476-83.
doi: 10.1093/bioinformatics/btu756. Epub 2014 Nov 13.

SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips

Affiliations

SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips

Shoshana Marcus et al. Bioinformatics. .

Abstract

Motivation: Genomics is expanding from a single reference per species paradigm into a more comprehensive pan-genome approach that analyzes multiple individuals together. A compressed de Bruijn graph is a sophisticated data structure for representing the genomes of entire populations. It robustly encodes shared segments, simple single-nucleotide polymorphisms and complex structural variations far beyond what can be represented in a collection of linear sequences alone.

Results: We explore deep topological relationships between suffix trees and compressed de Bruijn graphs and introduce an algorithm, splitMEM, that directly constructs the compressed de Bruijn graph in time and space linear to the total number of genomes for a given maximum genome size. We introduce suffix skips to traverse several suffix links simultaneously and use them to efficiently decompose maximal exact matches into graph nodes. We demonstrate the utility of splitMEM by analyzing the nine-strain pan-genome of Bacillus anthracis and up to 62 strains of Escherichia coli, revealing their core-genome properties.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of a graphical representation of a pan-genome. The four input genomes (A–D) are decomposed into segments shared or specific to the individuals in the population with edges maintaining the adjacencies of the segments
Fig. 2.
Fig. 2.
Different overlapping configurations of MEMs in a sequence. The colored blocks represent MEMs in a genomic sequence. Different colors are used for distinct MEMs
Fig. 3.
Fig. 3.
Part of the suffix tree for a genome (left) with the corresponding part of the compressed de Bruijn graph (right). Two MEMs in the suffix tree and the suffix links that are followed to decompose the larger MEM to at least three repeat nodes, the purple nodes in the graph on the right. x, y and z are characters. α, β and γ are strings. Suffix links are displayed in red
Fig. 4.
Fig. 4.
Levels of genome sharing in the nodes of the pan-genome graphs of 9 strains of B.anthracis (top) and E.coli (bottom). The plots show the fraction of nodes that have each level of sharing
Fig. 5.
Fig. 5.
The running time and peak memory of splitMEM on the pan-genome graphs of increasing numbers of E.coli strains with k-mer length of 25. Each point represents the minimum value recorded over five trials to reduce measurement noise introduced by competing activity of the server. The line represents the linear regression of the points

References

    1. Bowe A, et al. Proceedings of the 12th International Conference on Algorithms in Bioinformatics, Ljubljana, Slovenia. Berlin, Heidelberg: Springer-Verlag; 2012. Succinct de bruijn graphs; pp. 225–235.
    1. Cazaux B, et al. From indexing data structures to de bruijn graphs. 2014. http://hal-lirmm.ccsd.cnrs.fr/lirmm-00950983.
    1. Chikhi R, Rizk G. Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithm Mol. Biol. 2013;8:22. - PMC - PubMed
    1. Chikhi R, et al. On the representation of de bruijn graphs. RECOMB. 2014;Vol. 8394:35–55.
    1. Gusfield D. Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. New York, NY: Cambridge University Press; 1997.

Publication types