Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Aug;18(8):1336-46.
doi: 10.1101/gr.077065.108.

An MCMC algorithm for haplotype assembly from whole-genome sequence data

Affiliations

An MCMC algorithm for haplotype assembly from whole-genome sequence data

Vikas Bansal et al. Genome Res. 2008 Aug.

Abstract

In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ~ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ~1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from (http://www.cse.ucsd.edu/users/vibansal/HASH/).

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Illustration of how haplotypes can be assembled from sequenced reads. Each read is a fragment of one of the two chromosomes. Reads that share an allele at a common variant can be inferred to come from the same chromosome and joined together. Reads that differ at a particular variant can be inferred to come from different chromosomes and similarly extend the two haplotypes.
Figure 2.
Figure 2.
Illustration of how HS can be derived from a haplotype pair H and a subset S of the columns of the fragment matrix.
Figure 3.
Figure 3.
Illustration of the recursive graph-partitioning algorithm for computing Γ. The weighted graph G(X) derived from the fragment matrix X and a haplotype pair H is shown on the top right. The tree structure below demonstrates the recursive partitioning of the columns of X using min-cut computations in the graph G(X). The first cut (C1), partitions the columns of X into two subsets: S = {1, 2, 3, 4, 5} and S = {6, 7, 8, 9, 10, 11, 12, 13}. The second cut (labeled C2), further partitions the subset S into two smaller subsets: {1, 2, 3, 4} and {5}. Γ is obtained from the subsets labeling the nodes of the tree (except the root node).
Figure 4.
Figure 4.
Comparison of the switch error rate for the algorithm HASH and the MCMC algorithm with Γ1. The Y-axis is the average switch distance of the reconstructed haplotypes from the true haplotypes. The X-axis (simulated error rate) is the fraction of variant calls in the fragment matrix that were flipped.
Figure 5.
Figure 5.
Fraction of variant calls with a posterior error probability of ≥0.5 using the HASH algorithm for different values of ε. (A) False-positive rate, given by the fraction of “correct” variant calls with high posterior error probabilities. (B) True-positive rate, given as the fraction of “flipped” variant calls with high posterior error-probability.
Figure 6.
Figure 6.
Results of running the MCMC algorithm with different Γ on a fragment matrix with n = 200 columns (from chromosome 22 of HuRef genome). (A) A comparison of the HASH algorithm against two other MCMC algorithms: (1) ℳ(Γ1) and (2) ℳ(Γ) where Γ was computed using the recursive graph-partitioning algorithm G(X). All algorithms were initialized with a random haplotype pair. (B) Comparison of HASH algorithm initialized with a random haplotype against ℳ(Γ) (graph-partitioning) initialized with a good haplotype. Note that we are zooming in on the first 10,000 steps in the iteration.
Figure 7.
Figure 7.
The percentage of variant calls that are inconsistent with the best haplotype assembly for three different methods: Greedy heuristic (Levy et al. 2007), MCMC algorithm with Γ1 and the HASH algorithm for the 22 autosomes of HuRef.
Figure 8.
Figure 8.
Mismatch rate and the “adjusted mismatch rate” (error rate) of the HuRef haplotypes estimated by comparison with the CEU HapMap haplotypes. The error rate is plotted as a function of r2, i.e., computed for all pairs of adjacent SNPs with r2 greater than a certain value.
Figure 9.
Figure 9.
Comparison of haplotypes assembled using sequence data with the preferred HapMap phasing for each pair of adjacent SNPs inferred from the HapMap haplotypes. For three pairs of adjacent SNPs, the phase of the sequence-based haplotypes mismatches the preferred HapMap phasing (indicated by crosses). The first pair shows strong linkage disequilibrium (r2 = 0.95), and therefore, the mismatch is more likely to represent a switch error in the sequence-based haplotypes. For the second pair of SNPs, the sequence-based haplotypes are correct and the mismatch is due to low LD between the SNP pair. For the third pair, LD is again low and the mismatch is due to a switch error in the sequence-based haplotypes.

Similar articles

Cited by

References

    1. Bafna V., Istrail S., Lancia G., Rizzi R., Istrail S., Lancia G., Rizzi R., Lancia G., Rizzi R., Rizzi R. Polynomial and APX-hard cases of individual haplotyping problems. Theor. Comput. Sci. 2005;335:109–125.
    1. Bansal V., Bashir A., Bafna V., Bashir A., Bafna V., Bafna V. Evidence for large inversion polymorphisms in the human genome from HapMap data. Genome Res. 2007;17:219–230. - PMC - PubMed
    1. Bentley D. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
    1. Churchill G.A., Waterman M.S., Waterman M.S. The accuracy of dna sequences: Estimating sequence quality. Genomics. 1992;14:89–98. - PubMed
    1. Clark A.G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 1990;7:111–122. - PubMed

Publication types

LinkOut - more resources