Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Oct 15;30(20):2906-14.
doi: 10.1093/bioinformatics/btu416. Epub 2014 Jul 1.

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

Affiliations

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

Bogdan Pasaniuc et al. Bioinformatics. .

Abstract

Motivation: Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available.

Results: In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case-control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of [Formula: see text] association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.

Availability and implementation: Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/.

Contact: bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu

Supplementary information: Supplementary materials are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
HMM-imputed (x-axis) versus ImpG-Summary (y-axis) association statistics (z-scores) for the BD phenotype in WTCCC Data (left) and over height phenotype in 1958 Birth Cohort Data (right). Results for all other WTCCC phenotypes can be found in Supplementary Figure S4
Fig. 2.
Fig. 2.
HMM-imputed (x-axis) versus ImpG-Summary (y-axis) association statistics (z-scores) at known associated SNPs from NHGRI GWAS Catalog in WTCCC (left) and height in 1958 Birth Cohort Data (right)
Fig. 3.
Fig. 3.
HMM-imputed (x-axis) versus ImpG-Summary (y-axis) association statistics (z-scores) for the TG phenotype in the blood lipids data. Left denotes imputation of 10% of the z-scores using the remaining 90%, while right shows imputation results starting from all variants present on the Illumina 610 array. Results for all blood lipids phenotypes can be found in Supplementary Figure S13. ImpG-Summary took 4 CPU days for the 10% data and under 10 CPU h for the array-based imputation
Fig. 4.
Fig. 4.
Average variance per SNP (average association z2 − 1) binned by different functional classes for all four blood phenotypes. Left displays the absolute numbers attained across the original data and the ImpG-Summary imputation to 1000G (r2pred >0.8). Right figure shows the absolute difference between original data and 1000G imputed association statistics

References

    1. Aulchenko Y, et al. Probabel package for genome-wide association analysis of imputed data. BMC Bioinformatics. 2010;11:134. - PMC - PubMed
    1. Barrett JC, et al. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat. Genet. 2009;41:703–707. - PMC - PubMed
    1. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 2009;84:210–223. - PMC - PubMed
    1. Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. - PMC - PubMed
    1. Chi EC, et al. Genotype imputation via matrix completion. Genome Res. 2013;23:509–518. - PMC - PubMed

Publication types