Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 14;20(1):366.
doi: 10.1186/s12864-019-5761-7.

Guidance for DNA methylation studies: statistical insights from the Illumina EPIC array

Affiliations

Guidance for DNA methylation studies: statistical insights from the Illumina EPIC array

Georgina Mansell et al. BMC Genomics. .

Abstract

Background: There has been a steady increase in the number of studies aiming to identify DNA methylation differences associated with complex phenotypes. Many of the challenges of epigenetic epidemiology regarding study design and interpretation have been discussed in detail, however there are analytical concerns that are outstanding and require further exploration. In this study we seek to address three analytical issues. First, we quantify the multiple testing burden and propose a standard statistical significance threshold for identifying DNA methylation sites that are associated with an outcome. Second, we establish whether linear regression, the chosen statistical tool for the majority of studies, is appropriate and whether it is biased by the underlying distribution of DNA methylation data. Finally, we assess the sample size required for adequately powered DNA methylation association studies.

Results: We quantified DNA methylation in the Understanding Society cohort (n = 1175), a large population based study, using the Illumina EPIC array to assess the statistical properties of DNA methylation association analyses. By simulating null DNA methylation studies, we generated the distribution of p-values expected by chance and calculated the 5% family-wise error for EPIC array studies to be 9 × 10- 8. Next, we tested whether the assumptions of linear regression are violated by DNA methylation data and found that the majority of sites do not satisfy the assumption of normal residuals. Nevertheless, we found no evidence that this bias influences analyses by increasing the likelihood of affected sites to be false positives. Finally, we performed power calculations for EPIC based DNA methylation studies, demonstrating that existing studies with data on ~ 1000 samples are adequately powered to detect small differences at the majority of sites.

Conclusion: We propose that a significance threshold of P < 9 × 10- 8 adequately controls the false positive rate for EPIC array DNA methylation studies. Moreover, our results indicate that linear regression is a valid statistical methodology for DNA methylation studies, despite the fact that the data do not always satisfy the assumptions of this test. These findings have implications for epidemiological-based studies of DNA methylation and provide a framework for the interpretation of findings from current and future studies.

Keywords: DNA methylation; Epigenome-wide association study (EWAS); Illumina EPIC array; Multiple testing; Power.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Ethical approval for the Understanding Society nurse visit was obtained from the National Research Ethics Service (Reference: 10/H0604/2). Participants gave written consent for blood sampling.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Subsampling sites on the EPIC array to estimate a genome-wide significance threshold. Line graphs depicting the relationship between the number of EPIC array DNA methylation sites (x-axis) and a) the 5% family-wise error rate (FWER) (−log10(p-values); y-axis) and b) the mean effective number of tests (y-axis) estimated from 1000 simulated null association studies. Error bars present the 95% confidence intervals from 1000 simulations. The final point includes all DNA methylation sites on the EPIC array and therefore could not be resampled to generate a confidence interval
Fig. 2
Fig. 2
Extrapolation to a genome-wide significance threshold. Line graphs depicting the relationship between the number of DNA methylation sites (x-axis) and a) the effective number of independent tests (y-axis) and b) the multiple testing corrected threshold (−log10(p-value); y-axis) estimated after fitting a Monod function to the observed data presented in Fig. 1b. The observed values are plotted as the solid black line, and the estimated Monod model is plotted as a dashed line. The grey shaded region represents the 95% CI created by fitting a Monod model to the 95% CI of the subsampled data. The blue horizontal line represents the estimated asymptote of the Monod model of 5,803,067 independent tests equivalent to a genome-wide significance threshold of 8.62 × 10− 9
Fig. 3
Fig. 3
Overlap of significant violations of linear regression assumptions. Venn diagram depicting the overlap of DNA methylation sites significant for each test of a linear assumption (P < 9.42 × 10− 8). Presented are the number of overlapping DNA methylation sites along with the percentage of all tested sites shown in brackets
Fig. 4
Fig. 4
Comparison of tests of linear regression assumptions across the distribution of DNA methylation levels. Boxplots of –log10(p-value) for each of the 5 tests (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroskedasticity for groups of DNA methylation sites binned by their mean DNA methylation level. The boxes are coloured by their mean –log10(p-value) from light yellow (low) to red (high)
Fig. 5
Fig. 5
Comparison of tests of linear regression assumptions against DNA methylation variability. Boxplots of –log10(p-value) for each of the 5 tests (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroscedasticity for groups of DNA methylation sites binned by their standard deviation. The boxes are coloured by their mean –log10(p-value) from light yellow (low) to red (high)
Fig. 6
Fig. 6
Comparison of tests of linear regression assumptions with bias in DNA methylation association studies. Scatterplots of –log10(p-value) (y-axis) from the (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroskedasticity tests performed in the R gvlma package against average (mean) ranking from 1000 simulated null association studies (x-axis) for all DNA methylation sites. Each point represents a single site, and the color represents the density of points plotted at that position (low density in grey to high density in yellow)
Fig. 7
Fig. 7
Power curves of typical DNA methylation studies. Line graphs depicting the proportion of sites on the EPIC array (y-axis) with sufficient power (x-axis) to detect a mean difference in DNA methylation between two groups of (a) 2% and (b) 5%. The different coloured lines represent different sample sizes where the value of N the total sample size set to be a 50:50 split between groups

References

    1. Murphy TM, Mill J. Epigenetics in health and disease: heralding the EWAS era. Lancet. 2014;383(9933):1952–1954. doi: 10.1016/S0140-6736(14)60269-5. - DOI - PubMed
    1. Heyn H, Carmona FJ, Gomez A, Ferreira HJ, Bell JT, Sayols S, Ward K, Stefansson OA, Moran S, Sandoval J, et al. DNA methylation profiling in breast cancer discordant identical twins identifies DOK7 as novel epigenetic biomarker. Carcinogenesis. 2013;34(1):102–108. doi: 10.1093/carcin/bgs321. - DOI - PMC - PubMed
    1. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M, et al. The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat Genet. 2009;41(2):178–186. doi: 10.1038/ng.298. - DOI - PMC - PubMed
    1. Lange CP, Campan M, Hinoue T, Schmitz RF, van der Meulen-de Jong AE, Slingerland H, Kok PJ, van Dijk CM, Weisenberger DJ, Shen H, et al. Genome-scale discovery of DNA-methylation biomarkers for blood-based detection of colorectal cancer. PLoS One. 2012;7(11):e50266. doi: 10.1371/journal.pone.0050266. - DOI - PMC - PubMed
    1. Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, Reinius L, Acevedo N, Taub M, Ronninger M, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013;31(2):142–147. doi: 10.1038/nbt.2487. - DOI - PMC - PubMed