. 2010 Jun 15;26(12):i183-90.

doi: 10.1093/bioinformatics/btq215.

Optimal algorithms for haplotype assembly from whole-genome sequence data

Dan He¹, Arthur Choi, Knot Pipatsrisawat, Adnan Darwiche, Eleazar Eskin

Affiliations

PMID: 20529904
PMCID: PMC2881399
DOI: 10.1093/bioinformatics/btq215

Optimal algorithms for haplotype assembly from whole-genome sequence data

Dan He et al. Bioinformatics. 2010.

. 2010 Jun 15;26(12):i183-90.

doi: 10.1093/bioinformatics/btq215.

Authors

Dan He¹, Arthur Choi, Knot Pipatsrisawat, Adnan Darwiche, Eleazar Eskin

Affiliation

¹ Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA. danhe@cs.ucla.edu

PMID: 20529904
PMCID: PMC2881399
DOI: 10.1093/bioinformatics/btq215

Abstract

Motivation: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of 'haplotype assembly' is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem.

Results: In this article, we proposed a dynamic programming algorithm that is able to assemble the haplotypes optimally with time complexity O(m x 2(k) x n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotypes. We also reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of the current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches allow us to practically construct haplotypes containing both common and rare variants.

PubMed Disclaimer

Figures

**Fig. 1.**
(a) The number of short reads, all reads and (b) the length of haplotypes for each chromosome. The threshold for short reads is 15. The length of haplotypes is the number of heterozygous sites in each chromosome.

**Fig. 2.**
Graphical representation of the read matrix for the first block of Chromosome 22, where the reads are sorted by their starting positions. The rows are the reads and the columns are the haplotype positions. The black dots are the non-‘−’ cells for the short reads and the red dots are the non-‘−’ cells for the long reads. The red lines are the gap cells of the paired-end reads.

See this image and copyright information in PMC

Cited by

Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing.
Amini S, Pushkarev D, Christiansen L, Kostem E, Royce T, Turk C, Pignatelli N, Adey A, Kitzman JO, Vijayan K, Ronaghi M, Shendure J, Gunderson KL, Steemers FJ. Amini S, et al. Nat Genet. 2014 Dec;46(12):1343-9. doi: 10.1038/ng.3119. Epub 2014 Oct 19. Nat Genet. 2014. PMID: 25326703 Free PMC article.
An accurate algorithm for the detection of DNA fragments from dilution pool sequencing experiments.
Bansal V. Bansal V. Bioinformatics. 2018 Jan 1;34(1):155-162. doi: 10.1093/bioinformatics/btx436. Bioinformatics. 2018. PMID: 29036419 Free PMC article.
PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data.
Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C, Haghshenas E, Seidman D, Hach F, Hajirasouliha I, Sahinalp SC. Malikic S, et al. Genome Res. 2019 Nov;29(11):1860-1877. doi: 10.1101/gr.234435.118. Epub 2019 Oct 18. Genome Res. 2019. PMID: 31628256 Free PMC article.
Allele Phasing Greatly Improves the Phylogenetic Utility of Ultraconserved Elements.
Andermann T, Fernandes AM, Olsson U, Töpel M, Pfeil B, Oxelman B, Aleixo A, Faircloth BC, Antonelli A. Andermann T, et al. Syst Biol. 2019 Jan 1;68(1):32-46. doi: 10.1093/sysbio/syy039. Syst Biol. 2019. PMID: 29771371 Free PMC article.
The next phase in human genetics.
Bansal V, Tewhey R, Topol EJ, Schork NJ. Bansal V, et al. Nat Biotechnol. 2011 Jan;29(1):38-9. doi: 10.1038/nbt.1757. Nat Biotechnol. 2011. PMID: 21221098 No abstract available.

See all "Cited by" articles

References

1. 1000 Genomes Project. A deep catalog of human genetic variation. 2010 Available at http://www.1000genomes.org/ (last accessed date April 23, 2010)
1. Bansal V, Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008;24:i153. - PubMed
1. Bansal V, et al. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 2008;18:1336. - PMC - PubMed
1. Biere A, et al. Frontiers in Artificial Intelligence and Applications. Vol. 185. Nieume Hemweg, Amsterdam: IOS Press; 2009. Handbook of Satisfiability.
1. Browning B, Browning S. Haplotypic analysis of Wellcome Trust Case Control Consortium data. Hum. genet. 2008;123:273–280. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimal algorithms for haplotype assembly from whole-genome sequence data

Affiliation

Optimal algorithms for haplotype assembly from whole-genome sequence data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous