Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 28;10(1):5436.
doi: 10.1038/s41467-019-13225-y.

Accurate, scalable and integrative haplotype estimation

Affiliations

Accurate, scalable and integrative haplotype estimation

Olivier Delaneau et al. Nat Commun. .

Abstract

The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.

PubMed Disclaimer

Conflict of interest statement

E.T.D. is chairman and member of the Board, Hybridstat Ltd. O.D., J.-F.Z., M.R.R., and J.L.M. declare no competing interests.

Figures

Fig. 1
Fig. 1
SHAPEIT4 overview. In all panels is shown the unphased genotype data for individual G. a Selection of a small number of informative haplotypes. Conditioning haplotypes with long matches with the current estimate D0 identified by PBWT are underlined. Matches are shown in red and blue depending on the matched haplotype. b Illustration of the Li and Stephens model run on the five informative haplotypes with long matches identified with PBWT. This gives new estimates D1 for G. c An example of a set R of phase informative reads (i.e., overlapping multiple heterozygous genotypes) for individual G. Haplotype assembly on R gives three phase sets (PS1, PS2, and PS3). The phase sets are the information used by SHAPEIT4 when estimating the haplotypes for G. d Two examples of haplotype scaffolds for G. S1 is derived from trio information (M and F are genotype data for the mother and father of G). S2 is derived from a reference panel such as UKB. Only variants in G in the overlap with UKB are phased using all UKB haplotypes as reference panel. Source data are provided as a Source Data file.
Fig. 2
Fig. 2
Phasing performance on large sample sizes (UKB). a Switch error rates for all tested phasing methods as a function of sample size (going from 10,000 to 400,000 individuals). For each error rate is shown the 95% binomial confidence interval. b Corresponding running times in hours measured using LINUX time command (User + System time). c Percentage change in running time needed to phase a single genome out of 10,000 to 400,000 genomes relative to the time needed to phase one genome out of 10,000. Positive, null, or negative slopes are indicative of close-to-linear, linear or sub-linear scaling, respectively. d Running time as a function of switch error rates for various parameter values controlling the accuracy of the tested methods: the fixed number of conditioning states K in case of Beagle5/Eagle2 and the number P of PBWT neighbors for SHAPEIT4. This was run for 100,000 UKB genomes. 95% Binomial confidence interval are given. Source data are provided as a Source Data file.
Fig. 3
Fig. 3
Phasing performance from large reference panels (UKB). Switch error rates (ac) and running times (df) for all tested phasing methods as a function of the number of haplotypes in the reference panels. Three different sample sizes were tested for the main panel: 500 (a, d), 5000 (b, e), and 50,000 (c, f). For each error rate is given the 95% binomial confidence interval. Source data are provided as a Source Data file.
Fig. 4
Fig. 4
Phasing performance on high-coverage sequence data (GIAB). a Switch error rates with 95% binomial confidence intervals for each combination tested of sequencing reads and haplotype scaffold. b Same information than in a measured only at variants with minor allele frequency (MAF) above 1%. c Switch error rates measured only at variants belonging to phase sets (i.e., haplotype assembly) before (squares) and after (triangles) refinement by SHAPEIT4. Results are shown here assuming 0.01% error rate in the phase sets. Source data are provided as a Source Data file.
Fig. 5
Fig. 5
Phase genotypes in Bl + 1 using prefix array Al.

Similar articles

Cited by

References

    1. Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 2011;12:703–714. doi: 10.1038/nrg3054. - DOI - PMC - PubMed
    1. Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. - DOI - PMC - PubMed
    1. Loh PR, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016;48:1443–1448. doi: 10.1038/ng.3679. - DOI - PMC - PubMed
    1. Loh PR, Palamara PF, Price AL. Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 2016;48:811–816. doi: 10.1038/ng.3571. - DOI - PMC - PubMed
    1. O’Connell J, et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 2016;48:817–820. doi: 10.1038/ng.3583. - DOI - PMC - PubMed

Publication types