Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jul 1;29(13):i352-60.
doi: 10.1093/bioinformatics/btt213.

Haplotype assembly in polyploid genomes and identical by descent shared tracts

Affiliations

Haplotype assembly in polyploid genomes and identical by descent shared tracts

Derek Aguiar et al. Bioinformatics. .

Abstract

Motivation: Genome-wide haplotype reconstruction from sequence data, or haplotype assembly, is at the center of major challenges in molecular biology and life sciences. For complex eukaryotic organisms like humans, the genome is vast and the population samples are growing so rapidly that algorithms processing high-throughput sequencing data must scale favorably in terms of both accuracy and computational efficiency. Furthermore, current models and methodologies for haplotype assembly (i) do not consider individuals sharing haplotypes jointly, which reduces the size and accuracy of assembled haplotypes, and (ii) are unable to model genomes having more than two sets of homologous chromosomes (polyploidy). Polyploid organisms are increasingly becoming the target of many research groups interested in the genomics of disease, phylogenetics, botany and evolution but there is an absence of theory and methods for polyploid haplotype reconstruction.

Results: In this work, we present a number of results, extensions and generalizations of compass graphs and our HapCompass framework. We prove the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. Furthermore, we present graph theory-based algorithms for the problem of haplotype assembly using our previously developed HapCompass framework for (i) novel implementations of haplotype assembly optimizations (minimum error correction), (ii) assembly of a pair of individuals sharing a haplotype tract identical by descent and (iii) assembly of polyploid genomes. We evaluate our methods on 1000 Genomes Project, Pacific Biosciences and simulated sequence data.

Availability and implementation: HapCompass is available for download at http://www.brown.edu/Research/Istrail_Lab/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Construction of the input to the haplotype assembly problem
Fig. 2.
Fig. 2.
Construction of the compass graph from SNP-fragment matrix M. The SNP-fragment matrix M (left) contains four fragments and four SNPs. Each SNP’s pairwise phasing relationship defined by the fragments is represented on the edges of the compass graph (right). The majority rule phasing for one of the haplotypes is shown in red on the compass graph edges
Fig. 3.
Fig. 3.
A graph of the haplotype transitions defined by the majority rule phasings of a compass graph. SNPs 1, 2, 3, 4 and 5 (left to right) are shown with both alleles (vertices), and edge transitions are encoded by a specific type of line depending on whether the haplotype is shared IBD or unique to the child or parent. The genotype of the parent and child are 22 222 and 22 122, respectively (where the two corresponds to the heterozygote and 0 and 1 correspond to homozygous for the major and minor alleles, respectively)
Fig. 4.
Fig. 4.
Compass graphs formula image, a non-conflicting polyploid cycle (left), and formula image, a conflicting polyploid cycle (right). The vector on the edge corresponds to the haplotype counts for an edge in the format [00,01,10,11]. In both compass graphs, the haplotypes are 000, 000 and 111, while the reads in formula image are 000, 000 and 111, and the reads in formula image are 00–, 01–, 10–, –00, –00, –11, 0–0, 0–0 and 1–1
Fig. 5.
Fig. 5.
The chain graphs (top) formula image and (bottom) formula image corresponding to formula image and formula image, respectively
Fig. 6.
Fig. 6.
The auxiliary flow graphs (top) formula image and (bottom) formula image. For a k-ploid organism (in this case k = 3), a flow of k with 1 capacity on each edge corresponds to a valid assignment of haplotype paths to haplotypes of the phasing a level 1
Fig. 7.
Fig. 7.
Comparison between haplotype assembling the child individually versus with a parent. The haplotype size is number of SNPs in the component of GC, which represents the maximum number of SNPs that may be phased together
Fig. 8.
Fig. 8.
Comparison of the percentage of correctly phased polyploid SNP pairs for the greedy and probabilistic binning algorithms for varying number of input reads

References

    1. Aguiar D, Istrail S. Hapcompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J. Comput. Biol. 2012;19:577–590. - PMC - PubMed
    1. Bansal V, Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008;24:i153–i159. - PubMed
    1. Bansal V, et al. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 2008;18:1336–1346. - PMC - PubMed
    1. Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 2011;12:703–714. - PMC - PubMed
    1. Chen ZJ, Ni Z. Mechanisms of genomic rearrangements and gene expression changes in plant polyploids. BioEssays. 2006;28:240–252. - PMC - PubMed

Publication types