An MCMC algorithm for haplotype assembly from whole-genome sequence data

Vikas Bansal¹, Aaron L Halpern, Nelson Axelrod, Vineet Bafna

Affiliations

PMID: 18676820
PMCID: PMC2493424
DOI: 10.1101/gr.077065.108

An MCMC algorithm for haplotype assembly from whole-genome sequence data

Vikas Bansal et al. Genome Res. 2008 Aug.

. 2008 Aug;18(8):1336-46.

doi: 10.1101/gr.077065.108.

Authors

Vikas Bansal¹, Aaron L Halpern, Nelson Axelrod, Vineet Bafna

Affiliation

¹ Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA. vibansal@cs.ucsd.edu <vibansal@cs.ucsd.edu>

PMID: 18676820
PMCID: PMC2493424
DOI: 10.1101/gr.077065.108

Abstract

In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ~ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ~1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from (http://www.cse.ucsd.edu/users/vibansal/HASH/).

PubMed Disclaimer

Figures

**Figure 1.**
Illustration of how haplotypes can be assembled from sequenced reads. Each read is a fragment of one of the two chromosomes. Reads that share an allele at a common variant can be inferred to come from the same chromosome and joined together. Reads that differ at a particular variant can be inferred to come from different chromosomes and similarly extend the two haplotypes.

**Figure 2.**
Illustration of how *H_S* can be derived from a haplotype pair H and a subset S of the columns of the fragment matrix.

**Figure 3.**
Illustration of the recursive graph-partitioning algorithm for computing Γ. The weighted graph *G(X)* derived from the fragment matrix X and a haplotype pair H is shown on the *top right*. The tree structure below demonstrates the recursive partitioning of the columns of X using min-cut computations in the graph *G(X)*. The first cut (C₁), partitions the columns of X into two subsets: S = {1, 2, 3, 4, 5} and S = {6, 7, 8, 9, 10, 11, 12, 13}. The second cut (labeled C₂), further partitions the subset S into two smaller subsets: {1, 2, 3, 4} and {5}. Γ is obtained from the subsets labeling the nodes of the tree (except the root node).

**Figure 4.**
Comparison of the switch error rate for the algorithm HASH and the MCMC algorithm with Γ₁. The Y-axis is the average switch distance of the reconstructed haplotypes from the true haplotypes. The X-axis (simulated error rate) is the fraction of variant calls in the fragment matrix that were flipped.

**Figure 5.**
Fraction of variant calls with a posterior error probability of ≥0.5 using the HASH algorithm for different values of ε. (A) False-positive rate, given by the fraction of “correct” variant calls with high posterior error probabilities. (B) True-positive rate, given as the fraction of “flipped” variant calls with high posterior error-probability.

**Figure 6.**
Results of running the MCMC algorithm with different Γ on a fragment matrix with n = 200 columns (from chromosome 22 of HuRef genome). (A) A comparison of the HASH algorithm against two other MCMC algorithms: (1) ℳ(Γ₁) and (2) ℳ(Γ) where Γ was computed using the recursive graph-partitioning algorithm *G(X)*. All algorithms were initialized with a random haplotype pair. (B) Comparison of HASH algorithm initialized with a random haplotype against ℳ(Γ) (graph-partitioning) initialized with a good haplotype. Note that we are zooming in on the first 10,000 steps in the iteration.

**Figure 7.**
The percentage of variant calls that are inconsistent with the best haplotype assembly for three different methods: Greedy heuristic (Levy et al. 2007), MCMC algorithm with Γ₁ and the HASH algorithm for the 22 autosomes of HuRef.

**Figure 8.**
Mismatch rate and the “adjusted mismatch rate” (error rate) of the HuRef haplotypes estimated by comparison with the CEU HapMap haplotypes. The error rate is plotted as a function of r², i.e., computed for all pairs of adjacent SNPs with r² greater than a certain value.

**Figure 9.**
Comparison of haplotypes assembled using sequence data with the preferred HapMap phasing for each pair of adjacent SNPs inferred from the HapMap haplotypes. For three pairs of adjacent SNPs, the phase of the sequence-based haplotypes mismatches the preferred HapMap phasing (indicated by crosses). The first pair shows strong linkage disequilibrium (r² = 0.95), and therefore, the mismatch is more likely to represent a switch error in the sequence-based haplotypes. For the second pair of SNPs, the sequence-based haplotypes are correct and the mismatch is due to low LD between the SNP pair. For the third pair, LD is again low and the mismatch is due to a switch error in the sequence-based haplotypes.

See this image and copyright information in PMC

References

1. Bafna V., Istrail S., Lancia G., Rizzi R., Istrail S., Lancia G., Rizzi R., Lancia G., Rizzi R., Rizzi R. Polynomial and APX-hard cases of individual haplotyping problems. Theor. Comput. Sci. 2005;335:109–125.
1. Bansal V., Bashir A., Bafna V., Bashir A., Bafna V., Bafna V. Evidence for large inversion polymorphisms in the human genome from HapMap data. Genome Res. 2007;17:219–230. - PMC - PubMed
1. Bentley D. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
1. Churchill G.A., Waterman M.S., Waterman M.S. The accuracy of dna sequences: Estimating sequence quality. Genomics. 1992;14:89–98. - PubMed
1. Clark A.G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 1990;7:111–122. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An MCMC algorithm for haplotype assembly from whole-genome sequence data

Affiliation

An MCMC algorithm for haplotype assembly from whole-genome sequence data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources