Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jul;48(7):811-6.
doi: 10.1038/ng.3571. Epub 2016 Jun 6.

Fast and accurate long-range phasing in a UK Biobank cohort

Affiliations

Fast and accurate long-range phasing in a UK Biobank cohort

Po-Ru Loh et al. Nat Genet. 2016 Jul.

Abstract

Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Eagle algorithm and example phase calls after each step
We show phase calls for ten trio children after each successive step of the Eagle algorithm (applied to phase the first 40cM of chromosome 10 in all N≈150,000 UK Biobank samples except trio parents). At all trio-phased sites, red and blue indicate whether the first Eagle-phased haplotype for each child matches the maternal or paternal haplotype. (a) After the first step, a sizable proportion of each genome is covered by long segments of near-perfect phase; these segments are the regions in which long IBD is available from several relatives. (b) The second step, which uses both long and short IBD, fixes most of the phase switch errors in the first step. (c,d) The subsequent approximate HMM iterations further reduce the error rate.
Figure 2
Figure 2. Computational cost and accuracy of phasing methods
Benchmarks of Eagle and existing phasing methods (all run with default options) on N≈15,000, 50,000, and 150,000 UK Biobank samples and M=5,824 SNPs on chromosome 10. Log-log plots of (a) run times and (b) memory consumption using up to 10 cores of a 2.27 GHz Intel Xeon L5640 processor and up to two weeks of computation. (c) Mean switch error rate over 70 European-ancestry trios; error bars, s.e.m. All methods except HAPI-UR supported multithreading. As the HAPI-UR documentation suggested merging results from three independent runs with different random seeds, we parallelized these runs across three cores. (For the N≈150,000 experiment, HAPI-UR encountered a failed assertion bug for some random seeds, so we needed to try six random seeds to find three working seeds. We did not count this extra work against HAPI-UR.) Numeric data are provided in Supplementary Table 1.

References

    1. Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nature Reviews Genetics. 2011;12:703–714. - PMC - PubMed
    1. Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics. 2007;39:906–913. - PubMed
    1. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nature Reviews Genetics. 2010;11:499–511. - PubMed
    1. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010;34:816–834. - PMC - PubMed
    1. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics. 2012;44:955–959. - PMC - PubMed

References (Online Methods)

    1. Henn BM, et al. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLOS ONE. 2012 - PMC - PubMed
    1. Huang L, Bercovici S, Rodriguez JM, Batzoglou S. An effective filter for IBD detection in large datasets. PLOS ONE. 2014;9:e92713. - PMC - PubMed
    1. Rodriguez JM, Bercovici S, Huang L, Frostig R, Batzoglou S. Parente2: a fast and accurate method for detecting identity by descent. Genome Research. 2015;25:280–289. - PMC - PubMed
    1. Bulik-Sullivan B, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics. 2015;47:291–295. - PMC - PubMed
    1. Indyk P, Motwani R. Proceedings of the thirtieth annual ACM Symposium on Theory of Computing. ACM; 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. pp. 604–613.