Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Mar 1;26(5):589-95.
doi: 10.1093/bioinformatics/btp698. Epub 2010 Jan 15.

Fast and accurate long-read alignment with Burrows-Wheeler transform

Affiliations

Fast and accurate long-read alignment with Burrows-Wheeler transform

Heng Li et al. Bioinformatics. .

Abstract

Motivation: Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short reads but inefficient or not applicable for reads >200 bp because the algorithms are heavily and specifically tuned for short queries with low sequencing error rate. However, some sequencing platforms already produce longer reads and others are expected to become available soon. For longer reads, hashing-based software such as BLAT and SSAHA2 remain the only choices. Nonetheless, these methods are substantially slower than short-read aligners in terms of aligned bases per unit time.

Results: We designed and implemented a new algorithm, Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAT, and is several to tens of times faster than both.

Availability: http://bio-bwa.sourceforge.net

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Prefix trie and prefix DAWG of string ‘GOOGOL’. (A) Prefix trie. Symbol ‘∧’ marks the start of a string. The two numbers in a node gives the SA interval of the node. (B) Prefix DAWG constructed by collapsing nodes with the identical SA interval. For example, in the prefix trie, three nodes has SA interval [4, 4]. Their parents have interval [1, 2], [1, 2] and [1, 1], respectively. In the prefix DAWG, the [4, 4] node thus has parents [1, 2] and [1, 1]. Node [4, 4] represents three strings ‘OG’, ‘OGO’ and ‘OGOL’ with the first two strings being the prefix of ‘OGOL’. (A) is modified from Figure 1 in Li and Durbin (2009).

Similar articles

Cited by

References

    1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Blumer A, et al. The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 1985;40:31–55.
    1. Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; 1994. A block-sorting lossless data compression algorithm.
    1. Eid J, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138. - PubMed
    1. Ferragina P, Manzini G. Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS 2000) Redondo Beach, CA, USA: 2000. Opportunistic data structures with applications; pp. 390–398.

Publication types