Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jun 15;26(12):i183-90.
doi: 10.1093/bioinformatics/btq215.

Optimal algorithms for haplotype assembly from whole-genome sequence data

Affiliations

Optimal algorithms for haplotype assembly from whole-genome sequence data

Dan He et al. Bioinformatics. .

Abstract

Motivation: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of 'haplotype assembly' is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem.

Results: In this article, we proposed a dynamic programming algorithm that is able to assemble the haplotypes optimally with time complexity O(m x 2(k) x n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotypes. We also reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of the current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches allow us to practically construct haplotypes containing both common and rare variants.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) The number of short reads, all reads and (b) the length of haplotypes for each chromosome. The threshold for short reads is 15. The length of haplotypes is the number of heterozygous sites in each chromosome.
Fig. 2.
Fig. 2.
Graphical representation of the read matrix for the first block of Chromosome 22, where the reads are sorted by their starting positions. The rows are the reads and the columns are the haplotype positions. The black dots are the non-‘−’ cells for the short reads and the red dots are the non-‘−’ cells for the long reads. The red lines are the gap cells of the paired-end reads.

Similar articles

Cited by

References

    1. 1000 Genomes Project. A deep catalog of human genetic variation. 2010 Available at http://www.1000genomes.org/ (last accessed date April 23, 2010)
    1. Bansal V, Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008;24:i153. - PubMed
    1. Bansal V, et al. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 2008;18:1336. - PMC - PubMed
    1. Biere A, et al. Frontiers in Artificial Intelligence and Applications. Vol. 185. Nieume Hemweg, Amsterdam: IOS Press; 2009. Handbook of Satisfiability.
    1. Browning B, Browning S. Haplotypic analysis of Wellcome Trust Case Control Consortium data. Hum. genet. 2008;123:273–280. - PMC - PubMed

Publication types