Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms

Carl A Anderson¹, Fredrik H Pettersson, Jeffrey C Barrett, Joanna J Zhuang, Jiannis Ragoussis, Lon R Cardon, Andrew P Morris

Affiliations

PMID: 18589396
PMCID: PMC2443836
DOI: 10.1016/j.ajhg.2008.06.008

Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms

Carl A Anderson et al. Am J Hum Genet. 2008 Jul.

. 2008 Jul;83(1):112-9.

doi: 10.1016/j.ajhg.2008.06.008. Epub 2008 Jun 26.

Authors

Carl A Anderson¹, Fredrik H Pettersson, Jeffrey C Barrett, Joanna J Zhuang, Jiannis Ragoussis, Lon R Cardon, Andrew P Morris

Affiliation

¹ The Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, OX3 7BN Oxford, UK. carl.anderson@well.ox.ac.uk

PMID: 18589396
PMCID: PMC2443836
DOI: 10.1016/j.ajhg.2008.06.008

Abstract

Genotype imputation is potentially a zero-cost method for bridging gaps in coverage and power between genotyping platforms. Here, we quantify these gains in power and coverage by using 1,376 population controls that are from the 1958 British Birth Cohort and were genotyped by the Wellcome Trust Case-Control Consortium with the Illumina HumanHap 550 and Affymetrix SNP Array 5.0 platforms. Approximately 50% of genotypes at single-nucleotide polymorphisms (SNPs) exclusively on the HumanHap 550 can be accurately imputed from direct genotypes on the SNP Array 5.0 or Illumina HumanHap 300. This roughly halves differences in coverage and power between the platforms. When the relative cost of currently available genome-wide SNP platforms is accounted for, and finances are limited but sample size is not, the highest-powered strategy in European populations is to genotype a larger number of individuals with the HumanHap 300 platform and carry out imputation. Platforms consisting of around 1 million SNPs offer poor cost efficiency for SNP association in European populations.

PubMed Disclaimer

Figures

**Figure 1**
Minor-Allele Frequency of SNPs Directly Genotyped in 1,376 Samples from the 58C (A) Minor-allele frequency for the 450,769 SNPs that are featured on the HumanHap 550 but not the Affymetrix SNP Array 5.0 and are also polymorphic in the 58C. (B) Minor-allele frequency for the subset of 427,839 SNPs from (A) that are also polymorphic in the CEU HapMap data. (C) Minor-allele frequency for the 215,998 that are featured on the HumanHap 550 but not the Illumina HumanHap 300^∗ and are also polymorphic in the 58C. (D) Minor-allele frequency for the subset of 203,860 SNPs from (C) that are also polymorphic in the CEU HapMap data. Basing imputations on haplotype data from the HapMap causes variation at rare SNPs (MAF ≤ 0.02) to be lost.

**Figure 2**
Assessment of Imputed-Genotype Filtering Criteria Assessment of filtering criteria for the Illumina HumanHap 550 genotypes based on Affymetrix SNP Array 5.0 (A–C) and Illumina HumanHap300^∗ (D–F) genotype data. (A and D) The number of SNPs passing filter thresholds based on per-SNP measures of mean maximum posterior probability (blue) or genotype call rate (red). The number of these SNPs with an r² ≥ 0.8 between direct and imputed genotype calls is shown after the removal of SNPs not passing filtering thresholds based on per-SNP measures of mean maximum posterior probability (dark gray) and genotype call rate (light gray). (B and E) The PLS-DA Q² value after the removal of SNPs not passing filtering thresholds based on per-SNP measures of mean maximum posterior probability (blue) and genotype call rate (red). A Q² value of 1 indicates that the current PLS model can perfectly predict whether a given genotype vector is of direct or imputed origin. A Q² of 0 indicates that the model has no power to predict the genotype's origin. (C and F) Mean r² between direct and imputed genotypes after the removal of SNPs not passing filtering thresholds based on per-SNP measures of mean maximum posterior probability (blue) and genotype call rate (red).

**Figure 3**
Mean power to Detect Association to a Disease with a Fixed Baseline Sample Size Mean power to detect association (α = 10⁻⁵) to a disease with a population prevalence of 0.0001 and a fixed baseline sample size across different genome-wide platforms (simulated under varying risk allele frequency [RAF] and sample size). RAF ranges are as follows: (A–C) 0.05 ≤ RAF < 0.10; (D–F) 0.10 ≤ RAF < 0.20; (G and H) 0.20 ≤ RAF ≤ 0.50. Cases and controls are as follows: (A, D, and G) 1,000 cases, 1,000 controls; (B, E, and F) 2,000 cases, 2,000 controls; (C, F, and I) 5,000 cases, 5,000 controls. Mean power was calculated after 10,000 simulations where sample size per simulation for each SNP set was weighted by the maximum r² between a randomly selected HapMap SNP (satisfying RAF constraints) and the SNPs on the given genotyping platform (with HapMap release 21 CEU data).

**Figure 4**
Mean Power to Detect Association to a Disease Where Baseline Sample Size Has Been Varied across Genome-wide SNP Platforms to Reflect Relative Cost Mean power to detect association (α = 10⁻⁵) to a disease with a population prevalence of 0.0001 where baseline sample size has been varied across genome-wide SNP platforms to reflect the genotyping cost per sample (sample-size ratios: SNP Array 5.0 = 1.22; HumanHap 300 = 1.32; HumanHap 550 = 1; SNP Array 6.0 = 0.99; HumanHap 1M = 0.57). RAF ranges are as follows: (A–C) 0.05 ≤ RAF < 0.10; (D–F) 0.10 ≤ RAF < 0.20; (G and H) 0.20 ≤ RAF ≤ 0.50. Cases and controls are as follows: (A, D, and G) 1,000 cases, 1,000 controls; (B, E, and F) 2,000 cases, 2,000 controls; (C, F, and I) 5000 cases, 5,000 controls. Mean power was calculated after 10,000 simulations where sample size per simulation for each SNP set was weighted by the maximum r² between a randomly selected HapMap SNP (satisfying RAF constraints) and the SNPs on the given genotyping platform (with HapMap release 21 CEU data).

See this image and copyright information in PMC

References

1. Smyth D.J., Cooper J.D., Bailey R., Field S., Burren O., Smink L.J., Guja C., Ionescu-Tirgoviste C., Widmer B., Dunger D.B., et al. A genome-wide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region. Nat. Genet. 2006;38:617–619. - PubMed
1. McPherson R., Pertsemlidis A., Kavaslar N., Stewart A., Roberts R., Cox D.R., Hinds D.A., Pennacchio L.A., Tybjaerg-Hansen A., Folsom A.R., et al. A common allele on chromosome 9 associated with coronary heart disease. Science. 2007;316:1488–1491. - PMC - PubMed
1. Rioux J.D., Xavier R.J., Taylor K.D., Silverberg M.S., Goyette P., Huett A., Green T., Kuballa P., Barmada M.M., Datta L.W., et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat. Genet. 2007;39:596–604. - PMC - PubMed
1. The Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common disease and 3,000 shared controls. Nature. 2007;447:661–678. - PMC - PubMed
1. Barrett J.C., Cardon L.R. Evaluating coverage of genome-wide association studies. Nat. Genet. 2006;38:659–662. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms

Affiliation

Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources