Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul 3:15:232.
doi: 10.1186/1471-2105-15-232.

On the potential of models for location and scale for genome-wide DNA methylation data

Affiliations

On the potential of models for location and scale for genome-wide DNA methylation data

Simone Wahl et al. BMC Bioinformatics. .

Abstract

Background: With the help of epigenome-wide association studies (EWAS), increasing knowledge on the role of epigenetic mechanisms such as DNA methylation in disease processes is obtained. In addition, EWAS aid the understanding of behavioral and environmental effects on DNA methylation. In terms of statistical analysis, specific challenges arise from the characteristics of methylation data. First, methylation β-values represent proportions with skewed and heteroscedastic distributions. Thus, traditional modeling strategies assuming a normally distributed response might not be appropriate. Second, recent evidence suggests that not only mean differences but also variability in site-specific DNA methylation associates with diseases, including cancer. The purpose of this study was to compare different modeling strategies for methylation data in terms of model performance and performance of downstream hypothesis tests. Specifically, we used the generalized additive models for location, scale and shape (GAMLSS) framework to compare beta regression with Gaussian regression on raw, binary logit and arcsine square root transformed methylation data, with and without modeling a covariate effect on the scale parameter.

Results: Using simulated and real data from a large population-based study and an independent sample of cancer patients and healthy controls, we show that beta regression does not outperform competing strategies in terms of model performance. In addition, Gaussian models for location and scale showed an improved performance as compared to models for location only. The best performance was observed for the Gaussian model on binary logit transformed β-values, referred to as M-values. Our results further suggest that models for location and scale are specifically sensitive towards violations of the distribution assumption and towards outliers in the methylation data. Therefore, a resampling procedure is proposed as a mode of inference and shown to diminish type I error rate in practically relevant settings. We apply the proposed method in an EWAS of BMI and age and reveal strong associations of age with methylation variability that are validated in an independent sample.

Conclusions: Models for location and scale are promising tools for EWAS that may help to understand the influence of environmental factors and disease-related phenotypes on methylation variability and its role during disease development.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Performance of competing models for DNA methylation data. A and B Median, 5% and 95% quantile of pseudo R 2 in training and test data set, respectively, across the CpG sites. C and D Pseudo R 2 values of individual CpG sites in training and test data set, respectively. 1000 CpG sites were randomly chosen for this plot. E and F Proportion of CpG sites for which the respective model had the largest pseudo R 2 measure as compared to the competing models, in training and test data set, respectively. Model abbreviations are explained in Table 1.
Figure 2
Figure 2
Residual normal fit of competing models for DNA methylation data. A Proportion of CpG sites for which significant deviation of residuals from normality was indicated by Shapiro-Wilk test p-value <0.05. B Proportion of CpG sites for which the respective model had the best residual normal fit as compared to the competing models. Model abbreviations are explained in Table 1.
Figure 3
Figure 3
Simulation study: Observed type I error rates of hypothesis tests for covariate effects. Observed type I error is plotted against the effect size that the same covariate (BMI) had on the other distribution parameter. Simulation results are shown for beta distributed (A, B) and for real-data distributed methylation values (C, D). Model abbreviations are explained in Table 1.
Figure 4
Figure 4
Origins of inflated type I error rates of downstream hypothesis tests. Two examples of CpG sites with missing covariates (A) and strong outlier structure (B): From left to right - kernel density plot of methylation M-value, scatter plot of methylation M-value against BMI and kernel density plot of the test statistic null distribution assumed by the model lo+ (solid black line) and realized in the bootstrap samples after inclusion of genetic variants (solid blue line), for the test for a BMI effect on the scale parameter. The realized distribution approximately followed a normal distribution (dashed black line). tobs and solid red line: test statistic from the original data without inclusion of genetic variants.
Figure 5
Figure 5
Type I error control through the resampling procedure and inclusion of genetic variants as covariates. Observed type I error for (A)μ and (B)σ is plotted against effect size that the same covariate (BMI) had on the other distribution parameter. Simulation on real-data distributed methylation responses, before (solid lines) and after (dotdashed lines) application of the resampling procedure and inclusion of genetic variants as covariates. Model abbreviations are explained in Table 1.
Figure 6
Figure 6
EWAS results. Results for BMI (A and B) and age (C and D) effects on methylation level (μ) and variability (σ) are shown, respectively. Bold number in the right top corner: Number of CpG sites with genome-wide significant association (p<1.3·10-7) according to asymptotic test results. Numbers in the box represent percentages of associations that were significant according to resampling-based inference (red circle) and/or that were validated in the independent F3 study (at p<0.05, blue circle), or neither of them (bottom right). Numbers in brackets indicate the respective percentage of positive associations. p-values are from Fisher’s exact tests for enrichment of validated associations among the resampling-significant associations.

Similar articles

Cited by

References

    1. Portela A, Esteller M. Epigenetic modifications and human disease. Nat Biotechnol. 2010;28:1057–1068. - PubMed
    1. Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12:529–541. - PMC - PubMed
    1. Zeilinger S, Kühnel B, Klopp N, Baurecht H, Kleinschmidt A, Gieger C, Weidinger S, Lattka E, Adamski J, Peters A, Strauch K, Waldenberger M, Illig T. Tobacco smoking leads to extensive genome-wide changes in DNA methylation. PLoS ONE. 2013;8:e63812. - PMC - PubMed
    1. Petersen AK, Zeilinger S, Kastenmüller G, Römisch-Margl W, Brugger M, Peters A, Meisinger C, Strauch K, Hengstenberg C, Pagel P, Huber F, Mohney RP, Grallert H, Illig T, Adamski J, Waldenberger M, Gieger C, Suhre K. Epigenetics meets metabolomics: an epigenome-wide association study with blood serum metabolic traits. Hum Mol Genet. 2014;23(2):534–545. - PMC - PubMed
    1. Dick KJ, Nelson CP, Tsaprouni L, Sandling JK, Aïssi D, Wahl S, Meduri E, Morange PE, Gagnon F, Grallert H, Waldenberger M, Peters A, Erdmann J, Hengstenberg C, Cambien F, Goodall AH, Ouwehand WH, Schunkert H, Thompson JR, Spector TD, Gieger C, Trégouët DA, Deloukas P, Samani NJ. DNA methylation and body-mass index: a genome-wide analysis. The Lancet. 2014;383(9933):1990–1998. - PubMed

Publication types

LinkOut - more resources