Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jul 1;29(13):i361-70.
doi: 10.1093/bioinformatics/btt215.

Short read alignment with populations of genomes

Affiliations

Short read alignment with populations of genomes

Lin Huang et al. Bioinformatics. .

Abstract

Summary: The increasing availability of high-throughput sequencing technologies has led to thousands of human genomes having been sequenced in the past years. Efforts such as the 1000 Genomes Project further add to the availability of human genome variation data. However, to date, there is no method that can map reads of a newly sequenced human genome to a large collection of genomes. Instead, methods rely on aligning reads to a single reference genome. This leads to inherent biases and lower accuracy. To tackle this problem, a new alignment tool BWBBLE is introduced in this article. We (i) introduce a new compressed representation of a collection of genomes, which explicitly tackles the genomic variation observed at every position, and (ii) design a new alignment algorithm based on the Burrows-Wheeler transform that maps short reads from a newly sequenced genome to an arbitrary collection of two or more (up to millions of) genomes with high accuracy and no inherent bias to one specific genome.

Availability: http://viq854.github.com/bwbble.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
BWT and SA construction for S = mamaliga$. All the rotations of S are first listed (1) and then sorted in lexicographic order (2). The BWT string is assembled from the last character of each sorted rotation (i.e. the right-most column of the sorted rotations matrix) and the suffix array SA is given by the original position in S of the suffix at the start of each sorted rotation (the substring preceding $ in each rotation)
Fig. 2.
Fig. 2.
Lexicographic order of the IUPAC code
Fig. 3.
Fig. 3.
Gray code order of the IUPAC code using the formula image nucleotide-to-bit assignment
Fig. 4.
Fig. 4.
BWT and SA construction of a toy reference multi-genome S = RWYAYA$ (i.e. (Aformula imageG)(Aformula imageT)(Cformula imageT)A(Cformula imageT)A). All the rotations of S are first listed (1) and then sorted in Gray code order (2)
Fig. 5.
Fig. 5.
Bubble formed by superimposing three sample genomes
Fig. 6.
Fig. 6.
Reference multi-genome with various variations. Branch padding corresponds to reads of length formula image
Fig. 7.
Fig. 7.
The average and standard deviation of the number of SA intervals generated per iteration during simulated read alignment on the reference multi-genome (averaged over 10K reads)
Fig. 8.
Fig. 8.
Inexact matching algorithm. Returns the set of SA intervals of substrings of G matching R with up to n differences. All the auxiliary BWT structures are assumed to have been already pre-computed for G
Fig. 9.
Fig. 9.
(1) Read R from the simNA dataset unaligned to the single-genome reference S-G by BWA (with default parameters BWA can handle indels up to length 6) and correctly mapped to a bubble branch representing an insertion of 16 bp in the multi-genome reference M-G using BWBBLE. (2) Read R from the simNA dataset unaligned to the single-genome reference S-G by BWA and correctly aligned using BWBBLE. The mismatches between the read and the references are shown in red (for BWA this number exceeds the allowed six differences); the SNPs in the M-G that match the read bases are shown in blue (these SNPs are treated as matches). (Note: only the relevant portions of the read and references are shown.)
Fig. 10.
Fig. 10.
BWBBLE running time vs. genome collection size. The performance is shown for each simulated and real read dataset. The results for the following number of incorporated genomes are shown: 1109 (12 646 424 SNPs), 545 (23 271 932 SNPs) and 1090 (28 929 053 SNPs)

References

    1. Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. Technical Report SRC-RR-124. 1994 HP Labs.
    1. Cherf GM, et al. Automated forward and reverse ratcheting of DNA in a nanopore at 5-å precision. Nat. Biotechnol. 2012;30:344–348. - PMC - PubMed
    1. Cornish-Bowden A. IUPAC-IUB symbols for nucleotide nomenclature. Nucleic Acids Res. 1985;13:3021–3030. - PMC - PubMed
    1. DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. - PMC - PubMed
    1. Durbin M. So long, data depression. 2009. http://www.genomeweb.com/informatics/so-long-data-depression (31 May 2013, date last accessed)

Publication types