Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun;19(6):577-90.
doi: 10.1089/cmb.2012.0084.

HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data

Affiliations

HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data

Derek Aguiar et al. J Comput Biol. 2012 Jun.

Abstract

Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. However, haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current computational methods of determining haplotype phase from sequence data--known as haplotype assembly--have difficulties producing accurate results for large (1000 genomes-type) data or operate on restricted optimizations that are unrealistic considering modern high-throughput sequencing technologies. We present a novel algorithm, HapCompass, for haplotype assembly of densely sequenced human genome data. The HapCompass algorithm operates on a graph where single nucleotide polymorphisms (SNPs) are nodes and edges are defined by sequence reads and viewed as supporting evidence of co-occurring SNP alleles in a haplotype. In our graph model, haplotype phasings correspond to spanning trees. We define the minimum weighted edge removal optimization on this graph and develop an algorithm based on cycle basis local optimizations for resolving conflicting evidence. We then estimate the amount of sequencing required to produce a complete haplotype assembly of a chromosome. Using these estimates together with metrics borrowed from genome assembly and haplotype phasing, we compare the accuracy of HapCompass, the Genome Analysis ToolKit, and HapCut for 1000 Genomes Project and simulated data. We show that HapCompass performs significantly better for a variety of data and metrics. HapCompass is freely available for download (www.brown.edu/Research/Istrail_Lab/).

PubMed Disclaimer

Figures

FIG 1.
FIG 1.
The SNP-fragment matrix M is shown on the left containing four fragments and four SNPs. Each pairwise phasing relationship defined by the fragments is represented on the edges of the compass graph on the right. A positive edge indicates there is more evidence in the fragments for the formula image phasing while a negative edge indicates evidence for the opposite formula image phasing.
FIG 2.
FIG 2.
A compass graph GC is shown on the left with two conflicting cycles. One edge removal (s2, s3) makes GC happy by removing two conflicting cycles in one step. All spanning trees (ST) of the happy GC correspond to the same phasing, but only two are shown in the lower right corner.
FIG 3.
FIG 3.
In this simulation study, reads of size 100 bp were simulated on chromosome 22 of the 1000 genomes CEU individual NA12878. Mate pair lengths were sampled at random from one of four normal distribution with means [10 kb, 50 kb, 100 kb, 250 kb] and standard deviations [1 kb, 5 kb, 10 kb, 25 kb]. With these parameters for sequencing, we require about 10 million reads to connect most of the SNPs of GC.
FIG 4.
FIG 4.
We fit a linear least squares regression line to the FMPR measurement for algorithms Genome Analysis ToolKit (GATK), HapCut, and HapCompass on chromosome 22 of the 1000 genomes data for the NA12878 individual (top) for the 1000 genomes data only (Table 1) and (bottom) for the 1000 genomes data supplemented with 10 million simulated reads of length 100 and sequence base error rate of 0.05 (Table 2).

References

    1. Bafna Vi. Istrail S. Lancia G., et al. Polynomial and APX-hard cases of the individual haplotyping problem. Theor. Comput. Sci. 2005;335:109–125.
    1. Bansal V. Bafna V. HAPCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008;24:153–159. - PubMed
    1. Bansal V. Halpern A. L. Axelrod N., et al. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 2008;18:1336–1346. - PMC - PubMed
    1. Browning S.R. Browning B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 2011;12:703–714. - PMC - PubMed
    1. Deo N. Prabhu G. Krishnamoorthy M.S. Algorithms for generating fundamental cycles in a graph. ACM Trans. Math. Softw. 1982;8:26–42.

Publication types

LinkOut - more resources