. 2013 Jul 1;29(13):i361-70.

doi: 10.1093/bioinformatics/btt215.

Short read alignment with populations of genomes

Lin Huang¹, Victoria Popic, Serafim Batzoglou

Affiliations

PMID: 23813006
PMCID: PMC3694645
DOI: 10.1093/bioinformatics/btt215

Short read alignment with populations of genomes

Lin Huang et al. Bioinformatics. 2013.

. 2013 Jul 1;29(13):i361-70.

doi: 10.1093/bioinformatics/btt215.

Authors

Lin Huang¹, Victoria Popic, Serafim Batzoglou

Affiliation

¹ Department of Computer Science, Stanford University, Stanford, CA 94305, USA.

PMID: 23813006
PMCID: PMC3694645
DOI: 10.1093/bioinformatics/btt215

Abstract

Summary: The increasing availability of high-throughput sequencing technologies has led to thousands of human genomes having been sequenced in the past years. Efforts such as the 1000 Genomes Project further add to the availability of human genome variation data. However, to date, there is no method that can map reads of a newly sequenced human genome to a large collection of genomes. Instead, methods rely on aligning reads to a single reference genome. This leads to inherent biases and lower accuracy. To tackle this problem, a new alignment tool BWBBLE is introduced in this article. We (i) introduce a new compressed representation of a collection of genomes, which explicitly tackles the genomic variation observed at every position, and (ii) design a new alignment algorithm based on the Burrows-Wheeler transform that maps short reads from a newly sequenced genome to an arbitrary collection of two or more (up to millions of) genomes with high accuracy and no inherent bias to one specific genome.

Availability: http://viq854.github.com/bwbble.

PubMed Disclaimer

Figures

**Fig. 1.**
BWT and SA construction for *S = mamaliga$*. All the rotations of S are first listed (1) and then sorted in lexicographic order (2). The BWT string is assembled from the last character of each sorted rotation (i.e. the right-most column of the sorted rotations matrix) and the suffix array SA is given by the original position in S of the suffix at the start of each sorted rotation (the substring preceding $ in each rotation)

**Fig. 2.**
Lexicographic order of the IUPAC code

**Fig. 3.**
Gray code order of the IUPAC code using the nucleotide-to-bit assignment

formula image — **Fig. 3.**
Gray code order of the IUPAC code using the nucleotide-to-bit assignment

**Fig. 4.**
BWT and SA construction of a toy reference multi-genome *S = RWYAYA$* (i.e. (AG)(AT)(CT)A(CT)A). All the rotations of S are first listed (1) and then sorted in Gray code order (2)

**Fig. 5.**
Bubble formed by superimposing three sample genomes

**Fig. 6.**
Reference multi-genome with various variations. Branch padding corresponds to reads of length

**Fig. 7.**
The average and standard deviation of the number of SA intervals generated per iteration during simulated read alignment on the reference multi-genome (averaged over 10K reads)

**Fig. 8.**
Inexact matching algorithm. Returns the set of SA intervals of substrings of G matching R with up to n differences. All the auxiliary BWT structures are assumed to have been already pre-computed for G

**Fig. 9.**
(1) Read R from the simNA dataset unaligned to the single-genome reference *S-G* by BWA (with default parameters BWA can handle indels up to length 6) and correctly mapped to a bubble branch representing an insertion of 16 bp in the multi-genome reference *M-G* using BWBBLE. (2) Read R from the simNA dataset unaligned to the single-genome reference *S-G* by BWA and correctly aligned using BWBBLE. The mismatches between the read and the references are shown in red (for BWA this number exceeds the allowed six differences); the SNPs in the *M-G* that match the read bases are shown in blue (these SNPs are treated as matches). (Note: only the relevant portions of the read and references are shown.)

**Fig. 10.**
BWBBLE running time vs. genome collection size. The performance is shown for each simulated and real read dataset. The results for the following number of incorporated genomes are shown: 1109 (12 646 424 SNPs), 545 (23 271 932 SNPs) and 1090 (28 929 053 SNPs)

See this image and copyright information in PMC

References

1. Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. Technical Report SRC-RR-124. 1994 HP Labs.
1. Cherf GM, et al. Automated forward and reverse ratcheting of DNA in a nanopore at 5-å precision. Nat. Biotechnol. 2012;30:344–348. - PMC - PubMed
1. Cornish-Bowden A. IUPAC-IUB symbols for nucleotide nomenclature. Nucleic Acids Res. 1985;13:3021–3030. - PMC - PubMed
1. DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. - PMC - PubMed
1. Durbin M. So long, data depression. 2009. http://www.genomeweb.com/informatics/so-long-data-depression (31 May 2013, date last accessed)

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Short read alignment with populations of genomes

Affiliation

Short read alignment with populations of genomes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous