Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep 6;103(3):338-348.
doi: 10.1016/j.ajhg.2018.07.015. Epub 2018 Aug 9.

A One-Penny Imputed Genome from Next-Generation Reference Panels

Affiliations

A One-Penny Imputed Genome from Next-Generation Reference Panels

Brian L Browning et al. Am J Hum Genet. .

Abstract

Genotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1,000 phased target samples, Beagle 5.0's computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1,000 phased target samples at a cost of less than one US cent per sample.

Keywords: GWAS; genome-wide association study; genotype imputation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Composite Reference Haplotypes Long haplotype segments with identical allele sequences are shown with the same color and pattern. The target haplotype shares five segments of identical alleles with the reference haplotypes. The two composite reference haplotypes are each a mosaic of reference haplotypes, with the mosaics chosen so that the target haplotype also shares five segments of identical alleles with the composite reference haplotypes. This permits the two composite reference haplotypes to be used in place of the four original reference haplotypes.
Figure 2
Figure 2
Pseudocode for Constructing Composite Reference Haplotypes
Figure 3
Figure 3
Genotype Imputation Accuracy Genotype imputation accuracy when imputing genotypes from a 1000 Genomes Project reference panel (n = 2,452), a Haplotype Reference Consortium reference panel (n = 26,165), and from 10k, 100k, 1M, and 10M simulated UK-European reference samples. Imputed alleles are binned according to their minor allele count in each reference panel. The squared correlation (r2) between the true number of alleles on a haplotype (0 or 1) and the imputed posterior allele probability is reported for each minor allele count bin. The horizontal axis in each panel is on a log scale. The difference in accuracy for 10M reference samples is due to a difference in length of marker window.
Figure 4
Figure 4
Single-Threaded Computation Time Per-sample CPU time when imputing a 10 Mb region from 10k, 100k, 1M, and 10M simulated UK-European reference samples into 1,000 target samples using one computational thread. CPU time is the sum of the system and user time returned by the Unix time command. Impute4 was run with only the 10k reference panel due to software limitations. Minimac3 and Minimac4 were not run with the 10M reference panel due to memory and time constraints. (A) Results for Impute4, Minimac3, Minimac4, Beagle 4.1, and Beagle 5.0. (B) Zoomed-in results for Impute4, Minimac4, and Beagle 5.0.
Figure 5
Figure 5
Multi-Threaded Computation Time Per-sample wall-clock time when imputing a 10 Mb region from 10k, 100k, 1M, and 10M simulated UK-European reference samples into 1,000 target samples using 12 computational threads. Minimac3 was not run with the 1M reference panel using 12 threads due to memory constraints. Minimac3 and Minimac4 were not run with the 10M reference panel due to memory and time constraints. (A) Results for Minimac3, Minimac4, Beagle 4.1, and Beagle 5.0. (B) Zoomed-in results for Minimac4 and Beagle 5.0.

References

    1. Marchini J., Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010;11:499–511. - PubMed
    1. Marchini J., Howie B., Myers S., McVean G., Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. - PubMed
    1. MacArthur J., Bowler E., Cerezo M., Gil L., Hall P., Hastings E., Junkins H., McMahon A., Milano A., Morales J. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2017;45(D1):D896–D901. - PMC - PubMed
    1. Das S., Forer L., Schönherr S., Sidore C., Locke A.E., Kwong A., Vrieze S.I., Chew E.Y., Levy S., McGue M. Next-generation genotype imputation service and methods. Nat. Genet. 2016;48:1284–1287. - PMC - PubMed
    1. Browning B.L., Browning S.R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 2016;98:116–126. - PMC - PubMed

Publication types