Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 2;111(5):990-995.
doi: 10.1016/j.ajhg.2024.04.001. Epub 2024 Apr 17.

MagicalRsq-X: A cross-cohort transferable genotype imputation quality metric

Affiliations

MagicalRsq-X: A cross-cohort transferable genotype imputation quality metric

Quan Sun et al. Am J Hum Genet. .

Abstract

Since genotype imputation was introduced, researchers have been relying on the estimated imputation quality from imputation software to perform post-imputation quality control (QC). However, this quality estimate (denoted as Rsq) performs less well for lower-frequency variants. We recently published MagicalRsq, a machine-learning-based imputation quality calibration, which leverages additional typed markers from the same cohort and outperforms Rsq as a QC metric. In this work, we extended the original MagicalRsq to allow cross-cohort model training and named the new model MagicalRsq-X. We removed the cohort-specific estimated minor allele frequency and included linkage disequilibrium scores and recombination rates as additional features. Leveraging whole-genome sequencing data from TOPMed, specifically participants in the BioMe, JHS, WHI, and MESA studies, we performed comprehensive cross-cohort evaluations for predominantly European and African ancestral individuals based on their inferred global ancestry with the 1000 Genomes and Human Genome Diversity Project data as reference. Our results suggest MagicalRsq-X outperforms Rsq in almost every setting, with 7.3%-14.4% improvement in squared Pearson correlation with true R2, corresponding to 85-218 K variant gains. We further developed a metric to quantify the genetic distances of a target cohort relative to a reference cohort and showed that such metric largely explained the performance of MagicalRsq-X models. Finally, we found MagicalRsq-X saved up to 53 known genome-wide significant variants in one of the largest blood cell trait GWASs that would be missed using the original Rsq for QC. In conclusion, MagicalRsq-X shows superiority for post-imputation QC and benefits genetic studies by distinguishing well and poorly imputed lower-frequency variants.

Keywords: cross-cohort; genome-wide association studies; genotype imputation; imputation quality; machine learning; quality control; rare variants; variant filtering; whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1
Figure 1
MagicalRsq-X overview (A) Feature modification from the original MagicalRsq model. We first removed the estimated MAF feature derived from imputation output, which we refer to as MagicalRsq-X model v1. We then added recombination rate from 1000G and long-range LD scores (± 1 Mb) of four continental populations from TOP-LD, leading to MagicalRsq-X model v2. Finally, we added short-range LD scores (± 100 Kb) of the same four populations from TOP-LD, resulting in MagicalRsq-X model v3, which is the final model showing the best and most robust performance. (B) Overview of study cohorts in our evaluations. We leveraged TOPMed WGS data of four studies, BioMe, MESA, JHS, and WHI, as our internal evaluation cohorts. We first inferred local and global ancestry of individuals in these studies and then selected individuals who are primarily of European ancestry or admixed African ancestry based on inferred global genetic similarity (detailed in supplemental methods). We also added the CF participants as an external evaluation cohort. (C) Data preparation for MagicalRsq-X experiments. We first thinned the WGS data to array genotype density and then performed genotype imputation, which outputs individual-level imputed data and Rsq. We then calculated true R2 comparing imputed data with WGS data for imputed markers (i.e., those in WGS but not included in the thinned dataset). (D) Model training and evaluation using BioMe EUR for training and MESA EUR for testing as an example. Starting from BioMe EUR WGS data, we performed imputation as demonstrated in (C). After obtaining all the external variant-level features, which were further combined with true R2 and Rsq, we trained MagicalRsq-X models. For the testing cohort, MESA EUR in this example, we similarly performed data thinning and imputation. We then applied the models pre-trained from BioMe EUR to calculate MagicalRsq-X for MESA EUR. In our experiments, we similarly calculated true R2 in MESA EUR and evaluated the performance of MagicalRsq-X compared to Rsq. The dashed square around “true R2” in testing set means it is not required in real-life application and was used in our evaluation purpose.
Figure 2
Figure 2
Cross-cohort MagicalRsq-X model performance (A) Performance across the three EUR cohorts (BioMeEUR, MESA EUR, and WHI EUR) for low-frequency variants (MAF [0.5%, 5%]). We trained MagicalRsq-X models with randomly selected 10 K, 50 K, 100 K, 200 K, 500 K, and 1 M variants (x axis), each with five repeats. y axis is the squared Pearson correlation between MagicalRsq-X and true R2. Each row represents a testing cohort, and each column represents a training cohort. The diagonal components are missing on purpose because we only assess cross-cohort model performance. Red dashed lines represent squared Pearson correlation between standard Rsq and true R2, which serves as the benchmark. (B) Performance across the four AA cohorts (BioMe AA, JHS, MESA AA, and WHI AA) for rare variants (MAF <0.5%). (C–E) Comparison between true R2 vs. Rsq and true R2 vs. MagicalRsq-X for MESA EUR common variants (C), MESA AA low-frequency variants (D), and MESA AA rare variants (E) on chr10, where MagicalRsq-X shown was calculated from models trained with 100K variants from BioMe EUR (C) and WHI AA (D and E). For the smooth scatterplots, the darker the color, the larger the number of variants. Outliers are plotted separately. Red lines are 45-degree lines, and blue lines are the fitted lines.

References

    1. Sun Q., Graff M., Rowland B., Wen J., Huang L., Miller-Fleming T.W., Haessler J., Preuss M.H., Chai J.-F., Lee M.P., et al. Analyses of biomarker traits in diverse UK biobank participants identify associations missed by European-centric analysis strategies. J. Hum. Genet. 2022;67:87–93. - PMC - PubMed
    1. Sun Q., Broadaway K.A., Edmiston S.N., Fajgenbaum K., Miller-Fleming T., Westerkam L.L., Melendez-Gonzalez M., Bui H., Blum F.R., Levitt B., et al. Genetic variants associated with hidradenitis suppurativa. JAMA Dermatol. 2023;159:930–938. - PMC - PubMed
    1. Wojcik G.L., Graff M., Nishimura K.K., Tao R., Haessler J., Gignoux C.R., Highland H.M., Patel Y.M., Sorokin E.P., Avery C.L., et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. - PMC - PubMed
    1. Huerta-Chagoya A., Schroeder P., Mandla R., Deutsch A.J., Zhu W., Petty L., Yi X., Cole J.B., Udler M.S., Dornbos P., et al. The power of TOPMed imputation for the discovery of Latino-enriched rare variants associated with type 2 diabetes. Diabetologia. 2023;66:1273–1288. - PMC - PubMed
    1. Li Y., Willer C.J., Ding J., Scheet P., Abecasis G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. - PMC - PubMed

Publication types