Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 31;26(1):kxaf032.
doi: 10.1093/biostatistics/kxaf032.

A novel high-dimensional model for identifying regional DNA methylation QTLs

Affiliations

A novel high-dimensional model for identifying regional DNA methylation QTLs

Kaiqiong Zhao et al. Biostatistics. .

Abstract

Varying coefficient models offer the flexibility to learn the dynamic changes of regression coefficients. Despite their good interpretability and diverse applications, in high-dimensional settings, existing estimation methods for such models have important limitations. For example, we routinely encounter the need for variable selection when faced with a large collection of covariates with nonlinear/varying effects on outcomes, and no ideal solutions exist. One illustration of this situation could be identifying a subset of genetic variants with local influence on methylation levels in a regulatory region. To address this problem, we propose a composite sparse penalty that encourages both sparsity and smoothness for the varying coefficients. We present an efficient proximal gradient descent algorithm that scales to high-dimensional predictor spaces, providing sparse solutions for the varying coefficients. A comprehensive simulation study has been conducted to evaluate the performance of our approach in terms of estimation, prediction and selection accuracy. We show that the inclusion of smoothness control yields much better results over sparsity-only approaches. An adaptive version of the penalty offers additional performance gains. We further demonstrate the utility of our method in identifying regional mQTLs from asymptomatic samples in the CARTaGENE cohort. The methodology is implemented in the R package sparseSOMNiBUS, available on GitHub.

Keywords: methylation QTLs; proximal gradient descent; smoothness control; variable selection; varying coefficient model.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Fig. 1.
Fig. 1.
Regional mQTL patterns across three gene-based methylation regions. Each region is defined as the first exon and 2 kb upstream of the corresponding protein-coding gene. Shown are three representative regions—MIR4520-2, LINC01252, and COMTD1—ranked 7th, 1st, and 11th, respectively, out of 12,283 gene-defined regions (see Table 2). The rows display A) methylation proportions by genotype, B) estimated smooth SNP effects, and C) peak effect magnitudes across all candidate SNPs within a ± 2.5 Mb window. Together, these patterns illustrate the need to model both smooth and sparse genetic effects in regional mQTL analysis. Additional examples are shown in Fig. S11.
Fig. 2.
Fig. 2.
Estimates of the first 6 varying coefficients of one simulation run of A) Example 1 (formula image) and B) Example 2 (formula image), using the SSP, SSP0, group LASSO and GAM approaches. The red curves are the true formula image used to generate the data. The results over 100 simulation runs are shown in Figs S1 to S4 for Example 1 and Figs S5 to S8 for Example 2.
Fig. 3.
Fig. 3.
Performance measures A) of SSP, SSP0 and gLASSO when using 10 or 30 basis functions to expand formula image, labeled as “df = 10” and “df = 30,” and B) using the ordinary, 1SE rule and adaptive version of SSP, SSP0 and gLASSO. Data were generated from Example 1 (formula image). The top three panels show the values of IBIASformula image, IVAR and IMSE aggregated from all the 100 varying coefficients in the model. The bottom left panel displays the distribution of deviance errors. The “TP” and “FP” panels display the mean values of TP and FP numbers, as well as their SD (indicated by the error bar), over 100 simulation runs.

References

    1. Affinito O et al. 2020. Nucleotide distance influences co-methylation between nearby CpG sites. Genomics. 112:144–150. - PubMed
    1. Barber RF, Reimherr M, Schill T. 2017. The function-on-scalar LASSO with applications to longitudinal GWAS. Electron J Stat. 11:1351–1389.
    1. Cheung WA et al. 2017. Functional variation in allelic methylomes underscores a strong genetic contribution and reveals novel epigenetic alterations in the human epigenome. Genome Biol. 18:50–21. - PMC - PubMed
    1. Chouldechova A, Hastie T. 2015. Generalized additive model selection [preprint]. arXiv, arXiv:1506.03850.
    1. Eckhardt F et al. 2006. DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet. 38:1378–1385. - PMC - PubMed

LinkOut - more resources