. 2022 Nov 3;109(11):1986-1997.

doi: 10.1016/j.ajhg.2022.09.009. Epub 2022 Oct 4.

MagicalRsq: Machine-learning-based genotype imputation quality calibration

Quan Sun¹, Yingxi Yang², Jonathan D Rosen³, Min-Zhi Jiang⁴, Jiawen Chen¹, Weifang Liu¹, Jia Wen³, Laura M Raffield³, Rhonda G Pace⁵, Yi-Hui Zhou⁶, Fred A Wright⁷, Scott M Blackman⁸, Michael J Bamshad⁹, Ronald L Gibson¹⁰, Garry R Cutting¹¹, Michael R Knowles⁵, Daniel R Schrider³, Christian Fuchsberger¹², Yun Li¹³

Affiliations

¹ Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
² Department of Statistics and Data Science, Yale University, New Haven, CT 06520, USA.
³ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
⁴ Department of Applied Physical Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
⁵ Marsico Lung Institute/UNC CF Research Center, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
⁶ Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA.
⁷ Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA; Bioinformatics Research Center and Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
⁸ Division of Pediatric Endocrinology, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA.
⁹ Department of Pediatrics, University of Washington, Seattle, WA 98105, USA; Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.
¹⁰ Department of Pediatrics, University of Washington, Seattle, WA 98105, USA.
¹¹ Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA.
¹² Institute for Biomedicine, Eurac Research (affiliated with the University of Lübeck), Bolzano, Italy. Electronic address: cfuchsberger@eurac.edu.
¹³ Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. Electronic address: yunli@med.unc.edu.

PMID: 36198314
PMCID: PMC9674945
DOI: 10.1016/j.ajhg.2022.09.009

MagicalRsq: Machine-learning-based genotype imputation quality calibration

Quan Sun et al. Am J Hum Genet. 2022.

. 2022 Nov 3;109(11):1986-1997.

doi: 10.1016/j.ajhg.2022.09.009. Epub 2022 Oct 4.

Authors

Affiliations

¹ Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
² Department of Statistics and Data Science, Yale University, New Haven, CT 06520, USA.
³ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
⁴ Department of Applied Physical Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
⁵ Marsico Lung Institute/UNC CF Research Center, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
⁶ Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA.
⁷ Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA; Bioinformatics Research Center and Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
⁸ Division of Pediatric Endocrinology, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA.
⁹ Department of Pediatrics, University of Washington, Seattle, WA 98105, USA; Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.
¹⁰ Department of Pediatrics, University of Washington, Seattle, WA 98105, USA.
¹¹ Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA.
¹² Institute for Biomedicine, Eurac Research (affiliated with the University of Lübeck), Bolzano, Italy. Electronic address: cfuchsberger@eurac.edu.
¹³ Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. Electronic address: yunli@med.unc.edu.

PMID: 36198314
PMCID: PMC9674945
DOI: 10.1016/j.ajhg.2022.09.009

Abstract

Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R² than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub.

Keywords: XGBoost; genotype imputation; imputation quality; machine learning; post-imputation quality control.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

**Figure 1**
MagicalRsq workflow MagicalRsq starts from “training dataset array data” (which are data used for imputation among training individuals) and performs imputation using these data, which gives us standard Rsq and estimated MAF for each marker, in the training dataset. Then we calculate the true R² by comparing imputed dosages with truth genotypes (established by additional genotype data in the training set). Combining external MAF and alternative allele count (AC), as well as population genetics summary statistics, with the above three metrics (i.e., standard Rsq, estimated MAF, and true R²), we train MagicalRsq models using the XGBoost method where we build supervised models to predict true R² from all the other features. We then proceed to the testing dataset where we follow the same imputation workflow starting again from array genotype data and obtaining estimated MAF and standard Rsq after imputation. We then calculate MagicalRsq in the testing dataset by plugging in the predictor features into the MagicalRsq models built from the training dataset. Finally, we evaluate the performance of MagicalRsq (and Rsq) by comparing with true R² in the testing dataset. Yellow highlights represent all the instruments specific for the training dataset, light blue highlights represent the instruments specific for the testing dataset, green highlights represent external information used in both training and testing, and red rectangles represent statistics used during final evaluation and comparison of MagicalRsq and standard Rsq, using true R² as the gold standard.

**Figure 2**
Scenario 1, experiments 1–4: Training using even-numbered chromosomes and testing on odd-numbered chromosomes for CF 2k samples (A and B) Performance comparison between Rsq and MagicalRsq in terms of squared Pearson correlation with true R² for (A) 1000G-based imputation; (B) TOPMed-based imputation. (C) Smooth scatterplot showing Rsq or MagicalRsq (x axis) calculated from both matched- (second row) and mis-matched- (third row) models against true R² (y axis) for both 1000G-based (left) and TOPMed-based (right) imputation, for low-frequency variants on chromosome 13.

**Figure 3**
Scenario 2, experiments 9–12: Training models using 1000 UKB AFR samples and testing on 2,960 independent UKB AFR samples, for all variants with WES available (A and B) Performance comparison between Rsq and MagicalRsq in terms of squared Pearson correlation with true R² for (A) 1000G-based imputation; (B) TOPMed-based imputation. (C) Smooth scatterplot showing Rsq or MagicalRsq (x axis) calculated from both matched (second row) and mis-matched (third row) models against true R² (y axis) for both 1000G-based (left) and TOPMed-based (right) imputation, for all low-frequency variants with WES available.

**Figure 4**
Scenario 2, experiment 14: Training models using randomly selected subsets of variants The number of variants used for training varied from 10k to 1m. MagicalRsq models were built based on CF 2k samples and tested on the independent CF 3k samples. We repeated 5 times for each number of variants. Squared Pearson correlation with true R² was calculated and served as the evaluation metric. The red dashed line denotes the performance of standard Rsq. nvar, number of variants included in model training.

See this image and copyright information in PMC

References

1. Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. Sequencing of 53, 831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. - PMC - PubMed
1. Sun Q., Liu W., Rosen J.D., Huang L., Pace R.G., Dang H., Gallins P.J., Blue E.E., Ling H., Corvol H., et al. Leveraging TOPMed imputation server and constructing a cohort-specific imputation reference panel to enhance genotype imputation among cystic fibrosis patients. HGG Adv. 2022;3:100090. - PMC - PubMed
1. Kowalski M.H., Qian H., Hou Z., Rosen J.D., Tapia A.L., Shan Y., Jain D., Argos M., Arnett D.K., Avery C., et al. Use of >100, 000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genet. 2019;15:e1008500. - PMC - PubMed
1. Sun Q., Graff M., Rowland B., Wen J., Huang L., Miller-Fleming T.W., Haessler J., Preuss M.H., Chai J.-F., Lee M.P., et al. Analyses of biomarker traits in diverse UK biobank participants identify associations missed by European-centric analysis strategies. J. Hum. Genet. 2022;67:87–93. - PMC - PubMed
1. de Bakker P.I.W., Ferreira M.A.R., Jia X., Neale B.M., Raychaudhuri S., Voight B.F. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 2008;17:R122–R128. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MagicalRsq: Machine-learning-based genotype imputation quality calibration

Affiliations

MagicalRsq: Machine-learning-based genotype imputation quality calibration

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical