Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Feb;84(2):235-50.
doi: 10.1016/j.ajhg.2009.01.013.

Genotype-imputation accuracy across worldwide human populations

Affiliations

Genotype-imputation accuracy across worldwide human populations

Lucy Huang et al. Am J Hum Genet. 2009 Feb.

Abstract

A current approach to mapping complex-disease-susceptibility loci in genome-wide association (GWA) studies involves leveraging the information in a reference database of dense genotype data. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and tested for disease association. This imputation strategy has been successful for GWA studies in populations well represented by existing reference panels. We used genotypes at 513,008 autosomal single-nucleotide polymorphism (SNP) loci in 443 unrelated individuals from 29 worldwide populations to evaluate the "portability" of the HapMap reference panels for imputation in studies of diverse populations. When a single HapMap panel was leveraged for imputation of randomly masked genotypes, European populations had the highest imputation accuracy, followed by populations from East Asia, Central and South Asia, the Americas, Oceania, the Middle East, and Africa. For each population, we identified "optimal" mixtures of reference panels that maximized imputation accuracy, and we found that in most populations, mixtures including individuals from at least two HapMap panels produced the highest imputation accuracy. From a separate survey of additional SNPs typed in the same samples, we evaluated imputation accuracy in the scenario in which all genotypes at a given SNP position were unobserved and were imputed on the basis of data from a commercial "SNP chip," again finding that most populations benefited from the use of combinations of two or more HapMap reference panels. Our results can serve as a guide for selecting appropriate reference panels for imputation-based GWA analysis in diverse populations.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic of Experimental Designs The “Study sample” row represents data used in evaluting imputation accuracy in each design, with SNPs under consideration colored yellow. The “Study sample with missing genotypes” row represents corresponding data, with the unknown genotypes that are imputed colored in red. The “Reference panel” row represents example reference panels based on which imputation of missing genotypes or genotypes of untyped markers is performed. In a data set, each row corresponds to a haplotype and each column corresponds to a SNP position. (A) Inference of missing genotypes, without additional reference haplotypes. (B) Inference of missing genotypes, with a reference panel of haplotypes from a single reference sample (CEU, YRI, or CHB+JPT). (C) Inference of missing genotypes, with a mixture reference panel, formed by the taking of a specified ratio of haplotypes from the HapMap CEU, YRI, and CHB+JPT samples. (D) Inference of genotypes of untyped markers, with a mixture reference panel, formed by the aggregation of two or more HapMap samples. We evaluated imputation accuracy in (A–C) for randomly masked genotypes, and in (D) for genotypes of untyped markers.
Figure 2
Figure 2
Imputation Accuracy versus Proportion of Missing Genotypes, in Each of 29 Populations This analysis was based on samples of six individuals per population and it did not use any reference panel.
Figure 3
Figure 3
Imputation Accuracy versus Sample Size, in Each of 29 Populations This analysis used a proportion of missing genotypes equal to 15% and did not use any reference panel.
Figure 4
Figure 4
Imputation Accuracy versus Reference-Panel Size, in Each of 29 Populations, Given a Proportion of Missing Genotypes Equal to 15% To obtain comparable results, we used the entire HapMap YRI and CEU samples but only 120 of 180 HapMap CHB+JPT reference haplotypes. The rightmost column of “maximal” imputation accuracy represents the highest accuracy achieved by one of the HapMap reference panels, taken pointwise. Populations are color-coded and symbol-coded in the same manner as in Figure 3.
Figure 5
Figure 5
The Maximal Imputation Accuracy Achieved by One of the Three HapMap Reference Panels, in Each of 29 Populations, Given a Proportion of Missing Genotypes Equal to 15% This plot corresponds to the imputation accuracy obtained with a reference-panel size of 120 haplotypes, shown in the rightmost column (MAX) of Figure 4. For convenience in interpreting the figure, the vertical dashed line indicates 90% imputation accuracy.
Figure 6
Figure 6
Imputation Accuracy in Each of 29 Populations Achieved by Utilizing Mixtures of HapMap Samples Chosen According to Specified Ratios Each triangle represents imputation accuracy, for a given population, based on various mixtures of HapMap reference panels. The vertices of a triangle represent imputation accuracy based on single HapMap groups, whereas the edges and interior points represent imputation accuracy attained by the use of mixtures of HapMap reference panels. Darker colors indicate higher imputation accuracy; a darkened circle indicates the maximal imputation accuracy for a population. The spacing of the cutoffs for the various colors was set so that across all 29 populations, each color would be used equally often. The set of mixtures corresponded to the set of vectors (i1, i2, i3) of nonnegative integers, with i1 + i2 + i3 = 7. For each vector, we used as the reference panel the largest possible mixture sample that consisted of a1, a2, and a3 HapMap CHB+JPT, CEU, and YRI individuals, respectively, and that satisfied a1:a2:a3 = i1:i2:i3. Corresponding numbers of HapMap haplotypes in the mixtures, (a1, a2, a3), are shown in the larger triangle. Imputation accuracy was evaluated with the use of only chromosome 2, with a proportion of missing genotypes equal to 15%.
Figure 7
Figure 7
Imputation Accuracy for Inference of Genotypes of Untyped Markers, Based on One, Two, or All Three HapMap Reference Panels The plot on the left shows imputation accuracy based on each of seven choices. The bar plot on the right represents the maximal imputation accuracy among the seven choices, and it is colored according to the choice of optimal reference panel. For convenience in interpreting the figure, the vertical dashed line indicates 90% imputation accuracy. Each HapMap panel was used with its original size.
Figure 8
Figure 8
Squared Correlation Coefficient, r2, between the Genotypes Imputed from the Data of Jakobsson et al.19 and Those Directly Measured in the Data of Pemberton et al.,21 Based on One, Two, or All Three HapMap Reference Panels The plot on the left shows r2 based on each of seven choices. The bar plot on the right represents the maximal r2 among the seven choices and is colored according to the choice of optimal reference panel. For convenience in interpreting the figure, the vertical dashed line indicates a squared correlation coefficient of 0.9. Each HapMap panel was used with its original size.
Figure 9
Figure 9
Imputation Accuracy for Genotypes at Untyped Markers in the Jakobsson et al.19 Data with Minor-Allele Frequency > 0.2 versus Imputation Accuracy for Genotypes at Untyped Markers with Minor-Allele Frequency ≤ 0.2 For a given population, we separated markers into two categories on the basis of their MAF in the population, on average placing 220 markers into the lower-MAF category and 293 into the higher-MAF category. Using the imputed genotypes described in Figures 7 and 8 for each of the seven reference-panel choices, we determined the imputation accuracy, separately restricting our attention to low-MAF markers and high-MAF markers. For each population, the highest of these seven numbers for the high-MAF markers is plotted on the y axis and the highest of these seven numbers for the low-MAF markers is plotted on the x axis (in some cases, the underlying optimal reference panel differed for the high-MAF and low-MAF markers). The diagonal dashed line indicates identical imputation accuracy for the two MAF categories. The difference between the imputation accuracy of the low-MAF markers and that of the high-MAF markers is plotted in Figure S3.

Similar articles

Cited by

References

    1. International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. - PMC - PubMed
    1. International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. - PMC - PubMed
    1. Li Y., Ding J., Abecasis G.R. Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. Am. J. Hum. Genet. 2006;79:S2290.
    1. Nicolae D.L. Testing untyped alleles (TUNA) - applications to genome-wide association studies. Genet. Epidemiol. 2006;30:718–727. - PubMed
    1. Marchini J., Howie B., Myers S., McVean G., Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. - PubMed

Publication types

LinkOut - more resources