Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 20:14:1162690.
doi: 10.3389/fgene.2023.1162690. eCollection 2023.

Missingness adapted group informed clustered (MAGIC)-LASSO: a novel paradigm for phenotype prediction to improve power for genetic loci discovery

Affiliations

Missingness adapted group informed clustered (MAGIC)-LASSO: a novel paradigm for phenotype prediction to improve power for genetic loci discovery

Amanda Elswick Gentry et al. Front Genet. .

Abstract

Introduction: The availability of large-scale biobanks linking genetic data, rich phenotypes, and biological measures is a powerful opportunity for scientific discovery. However, real-world collections frequently have extensive missingness. While missing data prediction is possible, performance is significantly impaired by block-wise missingness inherent to many biobanks. Methods: To address this, we developed Missingness Adapted Group-wise Informed Clustered (MAGIC)-LASSO which performs hierarchical clustering of variables based on missingness followed by sequential Group LASSO within clusters. Variables are pre-filtered for missingness and balance between training and target sets with final models built using stepwise inclusion of features ranked by completeness. This research has been conducted using the UK Biobank (n > 500 k) to predict unmeasured Alcohol Use Disorders Identification Test (AUDIT) scores. Results: The phenotypic correlation between measured and predicted total score was 0.67 while genetic correlations between independent subjects was high >0.86. Discussion: Phenotypic and genetic correlations in real data application, as well as simulations, demonstrate the method has significant accuracy and utility for increasing power for genetic loci discovery.

Keywords: GWAS; LASSO; UK biobank; alcohol consumption; genetics; machine learning; missingness; prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Heatmap of pairwise correlation between missingness patterns across over 1000 UKB variables with highest observation counts. White space indicates no correlation, while darker shades on red and green indicate increasing levels of positive or negative correlation, respectively.
FIGURE 2
FIGURE 2
Heatmap of pairwise complete observation count across the same UKB variables shown in Figure 1, with blue indicating zero subjects with a given pair of variables observed and colors ranging from purple, to white, to red, indicating increasing counts of subjects with a given pair of variables present.
FIGURE 3
FIGURE 3
Flow logic of the MAGIC-LASSO procedure. The MAGIC-LASSO procedure begins with filtering, followed by clustering, then iterative Group-LASSO application until parsimony is achieved. Figure created with BioRender.com.
FIGURE 4
FIGURE 4
Conceptualization of how a dataset may be subdivided into a measured and unmeasured set. Where N represents the full sample size, NUnmeasured and NMeasured represent the subsets of subjects on whom the outcome of interest is either missing or measured, respectively. Then the amount of overlap in observation may be quantified for each of p additional variables. Figure created with BioRender.com.
FIGURE 5
FIGURE 5
Number of complete cases as a function of number of missingness blocks and overall random missingness across the dataset.
FIGURE 6
FIGURE 6
Densities curves showing observed and predicted outcomes and prediction residuals. (Left) Density curves of the observed and predicted scores; outcomes in the observed and predicted in the measured and unmeasured sets plotted for (A) AUDIT-Total, (B) AUDIT-C, and (C) AUDIT-C. (Right) Residual densities for AUDIT prediction; density curves with means noted showing the distribution of the prediction residuals for (D) AUDIT-Total, (E) AUDIT-Consumption, and (F) AUDIT-Problems.
FIGURE 7
FIGURE 7
(A) LDSC estimated heritabilities. SNP-based heritability estimates for the observed (green) and predicted in the measured (purple) and unmeasured (orange) sets for the AUDIT outcomes. (B) LDSC estimated genetic correlations. Genetic correlation estimated between the observed data and predicted scores in the measured sets (green) the observed data and the predicted scores in the unmeasured sets (orange) and the predicted scores in the measured and unmeasured sets (purple).

References

    1. Bates D., Maechler M. (2019). Matrix: Sparse and dense matrix classes and methods. Available at: https://CRAN.R-project.org/package=Matrix .
    1. Breheny P., Huang J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics Comput. 25 (2), 173–187. 10.1007/s11222-013-9424-2 - DOI - PMC - PubMed
    1. Breheny P., Zeng Y. (2022). Grpreg: Regularization paths for regression models with grouped covariates. Available at: https://cran.r-project.org/web/packages/grpreg/index.html .
    1. Bulik-Sullivan B. K., Loh P-R., Finucane H. K., Ripke S., Yang J. Schizophrenia Working Group of the Psychiatric Genomics Consortium (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47 (3), 291–295. Nature Publishing Group. 10.1038/ng.3211 - DOI - PMC - PubMed
    1. Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K., et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562 (7726), 203–209. 10.1038/s41586-018-0579-z - DOI - PMC - PubMed

LinkOut - more resources