Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 24:14:13.
doi: 10.1186/s13015-019-0148-5. eCollection 2019.

Prefix-free parsing for building big BWTs

Affiliations

Prefix-free parsing for building big BWTs

Christina Boucher et al. Algorithms Mol Biol. .

Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

Keywords: Burrows-Wheeler Transform; Compression-aware algorithms; Genomic databases; Prefix-free parsing.

PubMed Disclaimer

Conflict of interest statement

Competing interestsThe authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The suffix trie for our example with the three strings GATTACAT, GATACAT and GATTAGATA. The input is shown at the bottom, in red because we do not need to store it
Fig. 2
Fig. 2
The suffix tree for our example. We now also need to store the input
Fig. 3
Fig. 3
The suffix array for our example is the sequence of values stored in the leaves of the tree (which we need not store explicitly). The LF mapping is shown as the arrows between two copies of the suffix array; the arrows to values i such that T[SA[i]]=A are in red, to illustrate that they point to consecutive positions in the suffix array and do not cross. Since Ψ is the inverse of the LF mapping, it can be obtained by simply reversing the direction of the arrows
Fig. 4
Fig. 4
The BWT and the sorted list of characters for our example. Drawing arrows between corresponding occurrences of characters in the two strings gives us the diagram for the LF-mapping
Fig. 5
Fig. 5
Results versus various choices of parameters (wp) on a collection of 1000 Salmonella genomes (2.7 GB)
Fig. 6
Fig. 6
RLFM indexing efficiency for successively larger collections of genetically distinct human chr19s. Results for the prefix-free parsing step (“pfbwt”) are shown alongside the overall RLFM index-building (“rlfm_total”) and Bowtie (“bowtie”) results
Fig. 7
Fig. 7
RLFM average “count” query time for successively larger collections of genetically distinct human chr19s
Fig. 8
Fig. 8
Computational efficiency of the three stages of index building when SA sampling is included. Results are shown for the prefix-free parsing (“pfbwt”), back-stepping (“bwtscan”) and run-length encoding (“rle”) steps. “total” is the sum of the three steps

References

    1. The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. - DOI - PMC - PubMed
    1. Turnbull C, et al. The 100,000 genomes project: bringing whole genome sequencing to the nhs. Br Med J. 2018;361:1687. doi: 10.1136/bmj.k1687. - DOI - PubMed
    1. Carleton HA, Gerner-Smidt P. Whole-genome sequencing is taking over foodborne disease surveillance. Microbe. 2016;11:311–317.
    1. Stevens EL, Timme R, Brown EW, Allard MW, Strain E, Bunning K, Musser S. The public health impact of a publically available, environmental database of microbial genomes. Front Microbiol. 2017;8:808. doi: 10.3389/fmicb.2017.00808. - DOI - PMC - PubMed
    1. Burrows M, Wheeler DJ. A block-sorting lossless compression algorithm, Technical report. : Digital Equipment Corporation; 1994.

LinkOut - more resources