Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Dec 16:9:540.
doi: 10.1186/1471-2105-9-540.

Shape-IT: new rapid and accurate algorithm for haplotype inference

Affiliations

Shape-IT: new rapid and accurate algorithm for haplotype inference

Olivier Delaneau et al. BMC Bioinformatics. .

Abstract

Background: We have developed a new computational algorithm, Shape-IT, to infer haplotypes under the genetic model of coalescence with recombination developed by Stephens et al in Phase v2.1. It runs much faster than Phase v2.1 while exhibiting the same accuracy. The major algorithmic improvements rely on the use of binary trees to represent the sets of candidate haplotypes for each individual. These binary tree representations: (1) speed up the computations of posterior probabilities of the haplotypes by avoiding the redundant operations made in Phase v2.1, and (2) overcome the exponential aspect of the haplotypes inference problem by the smart exploration of the most plausible pathways (ie. haplotypes) in the binary trees.

Results: Our results show that Shape-IT is several orders of magnitude faster than Phase v2.1 while being as accurate. For instance, Shape-IT runs 50 times faster than Phase v2.1 to compute the haplotypes of 200 subjects on 6,000 segments of 50 SNPs extracted from a standard Illumina 300 K chip (13 days instead of 630 days). We also compared Shape-IT with other widely used software, Gerbil, PL-EM, Fastphase, 2SNP, and Ishape in various tests: Shape-IT and Phase v2.1 were the most accurate in all cases, followed by Ishape and Fastphase. As a matter of speed, Shape-IT was faster than Ishape and Fastphase for datasets smaller than 100 SNPs, but Fastphase became faster -but still less accurate- to infer haplotypes on larger SNP datasets.

Conclusion: Shape-IT deserves to be extensively used for regular haplotype inference but also in the context of the new high-throughput genotyping chips since it permits to fit the genetic model of Phase v2.1 on large datasets. This new algorithm based on tree representations could be used in other HMM-based haplotype inference software and may apply more largely to other fields using HMM.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic representation of a sample of n genotypes. In this example, the space of possible haplotypes Si for individual i contains 4 haplotype pairs with 8 distinct haplotypes. The possible phases between heterozygous markers are shown in bold.
Figure 2
Figure 2
Representation of the execution trellis of the hidden Markov model used to compute the probability of a haplotype. The haplotypes h1,..., h2n-2 denote the previously sampled haplotypes which are used to compute the probability of the observed haplotype h. The sets {o1,..., os} and {q1(k), ..., qs(k)} correspond respectively to the observed state sequence of haplotype h and to the hidden state sequence of haplotype hk. The transition probability aj(k,l) corresponds to the probability of jumping from hidden state qj(k) of haplotype hk to hidden state qj+1(l) of haplotype hl, and the emission probability bj(k) corresponds to the probability of observing oj given the hidden state qj(k). To compute the probability of observing the sequence h = {o1, ..., os} in this HMM, one must sum up the probabilities of observing h over all (2n - 2)s possible sequences of s hidden states which is done efficiently by the forward algorithm.
Figure 3
Figure 3
Different representations of the space of possible haplotypes pairs Si. The left panel (A) shows the list representation commonly used by haplotype software such as Phase v2.1. The lower right panel (C) shows the representation used by Shape-IT. White and black circles indicate the phases between the heterozygous SNPs. On this example we use the same genotype Gi described in Figure 1. For iterations as performed by Phase v2.1 (A), the list requires the exploration of 20 nodes (4 haplotype pairs × 5 SNPs). With the complete tree representation (B) 10 nodes need to be explored, and with the incomplete tree representation as performed by Shape-IT (C), only 7 nodes need to be explored. The difference observed between (B) and (C) results from the pruning strategy which avoids the exploration of the nodes with probability ≤ 0.01.
Figure 4
Figure 4
Algorithm 1 to compute the FDSL distribution on the complete haplotype tree.
Figure 5
Figure 5
Algorithm 2 to compute the FDSL distribution on the incomplete haplotype tree.
Figure 6
Figure 6
Accuracy of the different values tested for the threshold f in Shape-IT (grey boxes) compared to Phase v2.1 (black line). This comparison was done on 300 datasets of 50 Tag SNPs called CEU Illumina 50.

References

    1. Vasilescu A, Terashima Y, Enomoto M, Heath S, Poonpiriya V, Gatanaga H, Do H, Diop G, Hirtzig T, Auewarakul P, et al. A haplotype of the human CXCR1 gene protective against rapid disease progression in HIV-1+ patients. Proceedings of the National Academy of Sciences of the United States of America. 2007;104:3354–3359. doi: 10.1073/pnas.0611670104. - DOI - PMC - PubMed
    1. Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES. High-resolution haplotype structure in the human genome. Nature genetics. 2001;29:229–232. doi: 10.1038/ng1001-229. - DOI - PubMed
    1. Dawson E, Abecasis GR, Bumpstead S, Chen Y, Hunt S, Beare DM, Pabial J, Dibling T, Tinsley E, Kirby S, et al. A first-generation linkage disequilibrium map of human chromosome 22. Nature. 2002;418:544–548. doi: 10.1038/nature00864. - DOI - PubMed
    1. Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, et al. The structure of haplotype blocks in the human genome. Science. 2002;296:2225–2229. doi: 10.1126/science.1069424. - DOI - PubMed
    1. Johnson GC, Esposito L, Barratt BJ, Smith AN, Heward J, Di Genova G, Ueda H, Cordell HJ, Eaves IA, Dudbridge F, et al. Haplotype tagging for the identification of common disease genes. Nature genetics. 2001;29:233–237. doi: 10.1038/ng1001-233. - DOI - PubMed

Publication types