An MCMC algorithm for haplotype assembly from whole-genome sequence data
- PMID: 18676820
- PMCID: PMC2493424
- DOI: 10.1101/gr.077065.108
An MCMC algorithm for haplotype assembly from whole-genome sequence data
Abstract
In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ~ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ~1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from (http://www.cse.ucsd.edu/users/vibansal/HASH/).
Figures









Similar articles
-
Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm.BMC Bioinformatics. 2015 Jul 16;16:223. doi: 10.1186/s12859-015-0651-8. BMC Bioinformatics. 2015. PMID: 26178880 Free PMC article.
-
HapCUT: an efficient and accurate algorithm for the haplotype assembly problem.Bioinformatics. 2008 Aug 15;24(16):i153-9. doi: 10.1093/bioinformatics/btn298. Bioinformatics. 2008. PMID: 18689818
-
Optimal algorithms for haplotype assembly from whole-genome sequence data.Bioinformatics. 2010 Jun 15;26(12):i183-90. doi: 10.1093/bioinformatics/btq215. Bioinformatics. 2010. PMID: 20529904 Free PMC article.
-
A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem.Bioinformatics. 2010 Sep 15;26(18):2217-25. doi: 10.1093/bioinformatics/btq411. Epub 2010 Jul 11. Bioinformatics. 2010. PMID: 20624781 Free PMC article. Review.
-
The Need for a Human Pangenome Reference Sequence.Annu Rev Genomics Hum Genet. 2021 Aug 31;22:81-102. doi: 10.1146/annurev-genom-120120-081921. Epub 2021 Apr 30. Annu Rev Genomics Hum Genet. 2021. PMID: 33929893 Free PMC article. Review.
Cited by
-
MixSIH: a mixture model for single individual haplotyping.BMC Genomics. 2013;14 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-14-S2-S5. Epub 2013 Feb 15. BMC Genomics. 2013. PMID: 23445519 Free PMC article.
-
Global DNA hypomethylation coupled to repressive chromatin domain formation and gene silencing in breast cancer.Genome Res. 2012 Feb;22(2):246-58. doi: 10.1101/gr.125872.111. Epub 2011 Dec 7. Genome Res. 2012. PMID: 22156296 Free PMC article.
-
SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming.BMC Genomics. 2015 Apr 3;16(1):260. doi: 10.1186/s12864-015-1408-5. BMC Genomics. 2015. PMID: 25885901 Free PMC article.
-
Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm.BMC Bioinformatics. 2015 Jul 16;16:223. doi: 10.1186/s12859-015-0651-8. BMC Bioinformatics. 2015. PMID: 26178880 Free PMC article.
-
Sparse Tensor Decomposition for Haplotype Assembly of Diploids and Polyploids.BMC Genomics. 2018 Mar 21;19(Suppl 4):191. doi: 10.1186/s12864-018-4551-y. BMC Genomics. 2018. PMID: 29589554 Free PMC article.
References
-
- Bafna V., Istrail S., Lancia G., Rizzi R., Istrail S., Lancia G., Rizzi R., Lancia G., Rizzi R., Rizzi R. Polynomial and APX-hard cases of individual haplotyping problems. Theor. Comput. Sci. 2005;335:109–125.
-
- Bentley D. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
-
- Churchill G.A., Waterman M.S., Waterman M.S. The accuracy of dna sequences: Estimating sequence quality. Genomics. 1992;14:89–98. - PubMed
-
- Clark A.G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 1990;7:111–122. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources