Impact of pre-imputation SNP-filtering on genotype imputation results

Nab Raj Roshyara, Holger Kirsten, Katrin Horn, Peter Ahnert, Markus Scholz

PMID: 25112433
PMCID: PMC4236550
DOI: 10.1186/s12863-014-0088-5

Impact of pre-imputation SNP-filtering on genotype imputation results

Nab Raj Roshyara et al. BMC Genet. 2014.

. 2014 Aug 12:15:88.

doi: 10.1186/s12863-014-0088-5.

Authors

Nab Raj Roshyara, Holger Kirsten, Katrin Horn, Peter Ahnert, Markus Scholz

PMID: 25112433
PMCID: PMC4236550
DOI: 10.1186/s12863-014-0088-5

Abstract

Background: Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE.

Results: We considered three scenarios: imputation of partially missing genotypes with usage of an external reference panel, without usage of an external reference panel, as well as imputation of completely un-typed SNPs using an external reference panel. We first created various datasets applying different SNP quality filters and masking certain percentages of randomly selected high-quality SNPs. We imputed these SNPs and compared the results between the different filtering scenarios by using established and newly proposed measures of imputation quality. While the established measures assess certainty of imputation results, our newly proposed measures focus on the agreement with true genotypes. These measures showed that pre-imputation SNP-filtering might be detrimental regarding imputation quality. Moreover, the strongest drivers of imputation quality were in general the burden of missingness and the number of SNPs used for imputation. We also found that using a reference panel always improves imputation quality of partially missing genotypes. MaCH performed slightly better than IMPUTE2 in most of our scenarios. Again, these results were more pronounced when using our newly defined measures of imputation quality.

Conclusion: Even a moderate filtering has a detrimental effect on the imputation quality. Therefore little or no SNP filtering prior to imputation appears to be the best strategy for imputing small to moderately sized datasets. Our results also showed that for these datasets, MaCH performs slightly better than IMPUTE2 in most scenarios at the cost of increased computing time.

PubMed Disclaimer

Figures

**Figure 1**
**Venn-Diagram describing the intersection of SNP datasets filtered by different quality criteria.** Note that by definition, HQ is contained in every subset.

**Figure 2**
**Pairwise comparison of the analyzed measures of imputation quality.** Distribution and pair-wise correlation of SEN-scores obtained from MaCH ( MaCH_SEN) and IMPUTE (IMPUTE2_SEN), Hellinger score obtained from MaCH (MaCH_HELLI) and from IMPUTE (IMPUTE2_HELLI), MaCH Rsq-score(MaCH_Rsq) and IMPUTE-info (IMPUTE2_INFO) score are shown. We present the results for the scenario “Entire SNP imputation” without pre-filtering (“ALL”) with 50% missing SNPs. Values refer to the squared Pearson correlation.

See this image and copyright information in PMC

References

1. Huang J, Ellinghaus D, Franke A, Howie B, Li Y. 1000 Genomes-based imputation identifies novel and refined associations for the welcome trust case control consortium phase 1 data. Eur J Hum Genet. 2012;20:801–805. doi: 10.1038/ejhg.2012.3. - DOI - PMC - PubMed
1. Clark AG, Li J. Conjuring SNPs to detect associations. Nat Genet. 2007;39:815–816. doi: 10.1038/ng0707-815. - DOI - PubMed
1. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. - DOI - PMC - PubMed
1. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. - DOI - PMC - PubMed
1. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Impact of pre-imputation SNP-filtering on genotype imputation results

Impact of pre-imputation SNP-filtering on genotype imputation results

Authors

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources