Genotype-imputation accuracy across worldwide human populations

Lucy Huang¹, Yun Li, Andrew B Singleton, John A Hardy, Gonçalo Abecasis, Noah A Rosenberg, Paul Scheet

Affiliations

PMID: 19215730
PMCID: PMC2668016
DOI: 10.1016/j.ajhg.2009.01.013

Genotype-imputation accuracy across worldwide human populations

Lucy Huang et al. Am J Hum Genet. 2009 Feb.

. 2009 Feb;84(2):235-50.

doi: 10.1016/j.ajhg.2009.01.013.

Authors

Lucy Huang¹, Yun Li, Andrew B Singleton, John A Hardy, Gonçalo Abecasis, Noah A Rosenberg, Paul Scheet

Affiliation

¹ Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA. hlucy@umich.edu

PMID: 19215730
PMCID: PMC2668016
DOI: 10.1016/j.ajhg.2009.01.013

Abstract

A current approach to mapping complex-disease-susceptibility loci in genome-wide association (GWA) studies involves leveraging the information in a reference database of dense genotype data. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and tested for disease association. This imputation strategy has been successful for GWA studies in populations well represented by existing reference panels. We used genotypes at 513,008 autosomal single-nucleotide polymorphism (SNP) loci in 443 unrelated individuals from 29 worldwide populations to evaluate the "portability" of the HapMap reference panels for imputation in studies of diverse populations. When a single HapMap panel was leveraged for imputation of randomly masked genotypes, European populations had the highest imputation accuracy, followed by populations from East Asia, Central and South Asia, the Americas, Oceania, the Middle East, and Africa. For each population, we identified "optimal" mixtures of reference panels that maximized imputation accuracy, and we found that in most populations, mixtures including individuals from at least two HapMap panels produced the highest imputation accuracy. From a separate survey of additional SNPs typed in the same samples, we evaluated imputation accuracy in the scenario in which all genotypes at a given SNP position were unobserved and were imputed on the basis of data from a commercial "SNP chip," again finding that most populations benefited from the use of combinations of two or more HapMap reference panels. Our results can serve as a guide for selecting appropriate reference panels for imputation-based GWA analysis in diverse populations.

PubMed Disclaimer

Figures

**Figure 1**
Schematic of Experimental Designs The “Study sample” row represents data used in evaluting imputation accuracy in each design, with SNPs under consideration colored yellow. The “Study sample with missing genotypes” row represents corresponding data, with the unknown genotypes that are imputed colored in red. The “Reference panel” row represents example reference panels based on which imputation of missing genotypes or genotypes of untyped markers is performed. In a data set, each row corresponds to a haplotype and each column corresponds to a SNP position. (A) Inference of missing genotypes, without additional reference haplotypes. (B) Inference of missing genotypes, with a reference panel of haplotypes from a single reference sample (CEU, YRI, or CHB+JPT). (C) Inference of missing genotypes, with a mixture reference panel, formed by the taking of a specified ratio of haplotypes from the HapMap CEU, YRI, and CHB+JPT samples. (D) Inference of genotypes of untyped markers, with a mixture reference panel, formed by the aggregation of two or more HapMap samples. We evaluated imputation accuracy in (A–C) for randomly masked genotypes, and in (D) for genotypes of untyped markers.

**Figure 2**
Imputation Accuracy versus Proportion of Missing Genotypes, in Each of 29 Populations This analysis was based on samples of six individuals per population and it did not use any reference panel.

**Figure 3**
Imputation Accuracy versus Sample Size, in Each of 29 Populations This analysis used a proportion of missing genotypes equal to 15% and did not use any reference panel.

**Figure 4**
Imputation Accuracy versus Reference-Panel Size, in Each of 29 Populations, Given a Proportion of Missing Genotypes Equal to 15% To obtain comparable results, we used the entire HapMap YRI and CEU samples but only 120 of 180 HapMap CHB+JPT reference haplotypes. The rightmost column of “maximal” imputation accuracy represents the highest accuracy achieved by one of the HapMap reference panels, taken pointwise. Populations are color-coded and symbol-coded in the same manner as in Figure 3.

**Figure 5**
The Maximal Imputation Accuracy Achieved by One of the Three HapMap Reference Panels, in Each of 29 Populations, Given a Proportion of Missing Genotypes Equal to 15% This plot corresponds to the imputation accuracy obtained with a reference-panel size of 120 haplotypes, shown in the rightmost column (MAX) of Figure 4. For convenience in interpreting the figure, the vertical dashed line indicates 90% imputation accuracy.

**Figure 6**
Imputation Accuracy in Each of 29 Populations Achieved by Utilizing Mixtures of HapMap Samples Chosen According to Specified Ratios Each triangle represents imputation accuracy, for a given population, based on various mixtures of HapMap reference panels. The vertices of a triangle represent imputation accuracy based on single HapMap groups, whereas the edges and interior points represent imputation accuracy attained by the use of mixtures of HapMap reference panels. Darker colors indicate higher imputation accuracy; a darkened circle indicates the maximal imputation accuracy for a population. The spacing of the cutoffs for the various colors was set so that across all 29 populations, each color would be used equally often. The set of mixtures corresponded to the set of vectors (i₁, i₂, i₃) of nonnegative integers, with i₁ + i₂ + i₃ = 7. For each vector, we used as the reference panel the largest possible mixture sample that consisted of a₁, a₂, and a₃ HapMap CHB+JPT, CEU, and YRI individuals, respectively, and that satisfied a₁:a₂:a₃ = i₁:i₂:i₃. Corresponding numbers of HapMap haplotypes in the mixtures, (a₁, a₂, a₃), are shown in the larger triangle. Imputation accuracy was evaluated with the use of only chromosome 2, with a proportion of missing genotypes equal to 15%.

**Figure 7**
Imputation Accuracy for Inference of Genotypes of Untyped Markers, Based on One, Two, or All Three HapMap Reference Panels The plot on the left shows imputation accuracy based on each of seven choices. The bar plot on the right represents the maximal imputation accuracy among the seven choices, and it is colored according to the choice of optimal reference panel. For convenience in interpreting the figure, the vertical dashed line indicates 90% imputation accuracy. Each HapMap panel was used with its original size.

**Figure 8**
Squared Correlation Coefficient, r², between the Genotypes Imputed from the Data of Jakobsson et al.¹⁹ and Those Directly Measured in the Data of Pemberton et al.,²¹ Based on One, Two, or All Three HapMap Reference Panels The plot on the left shows r² based on each of seven choices. The bar plot on the right represents the maximal r² among the seven choices and is colored according to the choice of optimal reference panel. For convenience in interpreting the figure, the vertical dashed line indicates a squared correlation coefficient of 0.9. Each HapMap panel was used with its original size.

**Figure 9**
Imputation Accuracy for Genotypes at Untyped Markers in the Jakobsson et al.¹⁹ Data with Minor-Allele Frequency > 0.2 versus Imputation Accuracy for Genotypes at Untyped Markers with Minor-Allele Frequency ≤ 0.2 For a given population, we separated markers into two categories on the basis of their MAF in the population, on average placing 220 markers into the lower-MAF category and 293 into the higher-MAF category. Using the imputed genotypes described in Figures 7 and 8 for each of the seven reference-panel choices, we determined the imputation accuracy, separately restricting our attention to low-MAF markers and high-MAF markers. For each population, the highest of these seven numbers for the high-MAF markers is plotted on the y axis and the highest of these seven numbers for the low-MAF markers is plotted on the x axis (in some cases, the underlying optimal reference panel differed for the high-MAF and low-MAF markers). The diagonal dashed line indicates identical imputation accuracy for the two MAF categories. The difference between the imputation accuracy of the low-MAF markers and that of the high-MAF markers is plotted in Figure S3.

See this image and copyright information in PMC

References

1. International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. - PMC - PubMed
1. International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. - PMC - PubMed
1. Li Y., Ding J., Abecasis G.R. Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. Am. J. Hum. Genet. 2006;79:S2290.
1. Nicolae D.L. Testing untyped alleles (TUNA) - applications to genome-wide association studies. Genet. Epidemiol. 2006;30:718–727. - PubMed
1. Marchini J., Howie B., Myers S., McVean G., Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Genotype-imputation accuracy across worldwide human populations

Affiliation

Genotype-imputation accuracy across worldwide human populations

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources