Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 3;109(11):1986-1997.
doi: 10.1016/j.ajhg.2022.09.009. Epub 2022 Oct 4.

MagicalRsq: Machine-learning-based genotype imputation quality calibration

Affiliations

MagicalRsq: Machine-learning-based genotype imputation quality calibration

Quan Sun et al. Am J Hum Genet. .

Abstract

Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R2 than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub.

Keywords: XGBoost; genotype imputation; imputation quality; machine learning; post-imputation quality control.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
MagicalRsq workflow MagicalRsq starts from “training dataset array data” (which are data used for imputation among training individuals) and performs imputation using these data, which gives us standard Rsq and estimated MAF for each marker, in the training dataset. Then we calculate the true R2 by comparing imputed dosages with truth genotypes (established by additional genotype data in the training set). Combining external MAF and alternative allele count (AC), as well as population genetics summary statistics, with the above three metrics (i.e., standard Rsq, estimated MAF, and true R2), we train MagicalRsq models using the XGBoost method where we build supervised models to predict true R2 from all the other features. We then proceed to the testing dataset where we follow the same imputation workflow starting again from array genotype data and obtaining estimated MAF and standard Rsq after imputation. We then calculate MagicalRsq in the testing dataset by plugging in the predictor features into the MagicalRsq models built from the training dataset. Finally, we evaluate the performance of MagicalRsq (and Rsq) by comparing with true R2 in the testing dataset. Yellow highlights represent all the instruments specific for the training dataset, light blue highlights represent the instruments specific for the testing dataset, green highlights represent external information used in both training and testing, and red rectangles represent statistics used during final evaluation and comparison of MagicalRsq and standard Rsq, using true R2 as the gold standard.
Figure 2
Figure 2
Scenario 1, experiments 1–4: Training using even-numbered chromosomes and testing on odd-numbered chromosomes for CF 2k samples (A and B) Performance comparison between Rsq and MagicalRsq in terms of squared Pearson correlation with true R2 for (A) 1000G-based imputation; (B) TOPMed-based imputation. (C) Smooth scatterplot showing Rsq or MagicalRsq (x axis) calculated from both matched- (second row) and mis-matched- (third row) models against true R2 (y axis) for both 1000G-based (left) and TOPMed-based (right) imputation, for low-frequency variants on chromosome 13.
Figure 3
Figure 3
Scenario 2, experiments 9–12: Training models using 1000 UKB AFR samples and testing on 2,960 independent UKB AFR samples, for all variants with WES available (A and B) Performance comparison between Rsq and MagicalRsq in terms of squared Pearson correlation with true R2 for (A) 1000G-based imputation; (B) TOPMed-based imputation. (C) Smooth scatterplot showing Rsq or MagicalRsq (x axis) calculated from both matched (second row) and mis-matched (third row) models against true R2 (y axis) for both 1000G-based (left) and TOPMed-based (right) imputation, for all low-frequency variants with WES available.
Figure 4
Figure 4
Scenario 2, experiment 14: Training models using randomly selected subsets of variants The number of variants used for training varied from 10k to 1m. MagicalRsq models were built based on CF 2k samples and tested on the independent CF 3k samples. We repeated 5 times for each number of variants. Squared Pearson correlation with true R2 was calculated and served as the evaluation metric. The red dashed line denotes the performance of standard Rsq. nvar, number of variants included in model training.

References

    1. Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. Sequencing of 53, 831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. - PMC - PubMed
    1. Sun Q., Liu W., Rosen J.D., Huang L., Pace R.G., Dang H., Gallins P.J., Blue E.E., Ling H., Corvol H., et al. Leveraging TOPMed imputation server and constructing a cohort-specific imputation reference panel to enhance genotype imputation among cystic fibrosis patients. HGG Adv. 2022;3:100090. - PMC - PubMed
    1. Kowalski M.H., Qian H., Hou Z., Rosen J.D., Tapia A.L., Shan Y., Jain D., Argos M., Arnett D.K., Avery C., et al. Use of >100, 000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genet. 2019;15:e1008500. - PMC - PubMed
    1. Sun Q., Graff M., Rowland B., Wen J., Huang L., Miller-Fleming T.W., Haessler J., Preuss M.H., Chai J.-F., Lee M.P., et al. Analyses of biomarker traits in diverse UK biobank participants identify associations missed by European-centric analysis strategies. J. Hum. Genet. 2022;67:87–93. - PMC - PubMed
    1. de Bakker P.I.W., Ferreira M.A.R., Jia X., Neale B.M., Raychaudhuri S., Voight B.F. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 2008;17:R122–R128. - PMC - PubMed

Publication types