Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Feb;41(1):200-9.
doi: 10.1093/ije/dyr238.

Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies

Affiliations

Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies

Andrew E Jaffe et al. Int J Epidemiol. 2012 Feb.

Abstract

Background: During the past 5 years, high-throughput technologies have been successfully used by epidemiology studies, but almost all have focused on sequence variation through genome-wide association studies (GWAS). Today, the study of other genomic events is becoming more common in large-scale epidemiological studies. Many of these, unlike the single-nucleotide polymorphism studied in GWAS, are continuous measures. In this context, the exercise of searching for regions of interest for disease is akin to the problems described in the statistical 'bump hunting' literature.

Methods: New statistical challenges arise when the measurements are continuous rather than categorical, when they are measured with uncertainty, and when both biological signal, and measurement errors are characterized by spatial correlation along the genome. Perhaps the most challenging complication is that continuous genomic data from large studies are measured throughout long periods, making them susceptible to 'batch effects'. An example that combines all three characteristics is genome-wide DNA methylation measurements. Here, we present a data analysis pipeline that effectively models measurement error, removes batch effects, detects regions of interest and attaches statistical uncertainty to identified regions.

Results: We illustrate the usefulness of our approach by detecting genomic regions of DNA methylation associated with a continuous trait in a well-characterized population of newborns. Additionally, we show that addressing unexplained heterogeneity like batch effects reduces the number of false-positive regions.

Conclusions: Our framework offers a comprehensive yet flexible approach for identifying genomic regions of biological interest in large epidemiological studies using quantitative high-throughput methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example of a differentially methylation region (DMR). (A) The points show methylation measurements from the colon cancer dataset plotted against genomic location from illustrative region on chromosome 2. Eight normal and eight cancer samples are shown in this plot and represented by eight blue points and eight red points at each genomic location for which measurements were available. The curves represent the smooth estimate of the population-level methylation profiles for cancer (red) and normal (blue) samples. The green bar represents a region known to be a cancer DMR. (B) The black curve is an estimate of the population-level difference between normal and cancer. We expect the curve to vary due to measurement error and biological variation but to rarely exceed a certain threshold, for example those represented by the red horizontal lines. Candidate DMRs are defined as the regions for which this black curve is outside these boundaries. Note that the DMR manifests as a bump in the black curve
Figure 2
Figure 2
Step-by-step illustration of our bump-hunting algorithm. (A) Logit-transformed methylation measurements are plotted against the outcome of interest (gestational age) for a specific probe j. A regression line obtained from fitting the model presented in Equation 1 is shown as well. The estimated slope formula image is retained for the next step. (B) For 48 consecutive probes, the estimated formula image are plotted against their genomic location tj. The specific estimated slope from the probe in (A) is indicated by ‘A’ and an arrow. The blue curve represents the smooth estimate formula image obtained using loess. (C) The smooth estimate formula image from (b) is shown but here with predefined thresholds represented by red horizontal lines. The region for which formula image exceeds the lower threshold is considered a candidate DMR. The area shaded in grey is used as a summary statistic. (D) A null distribution for the area summary statistic described in (c) is estimated by performing using permutations (as described in the text). The histogram summarizes the null areas obtained from permutations and estimates the null distribution. The area obtained from the region shown in (C) is highlighted with an arrow and the label ‘C’. Note that this DMR region is not statistically significant as it can easily happen by chance
Figure 3
Figure 3
Receiver operating characteristic curves obtained from Monte Carlo simulation. True positive rate is plotted against false-positive rate for various tuning parameters needed for the bump hunting procedure. We examined the performance of three choices for the threshold used to define candidate DMRs. The three choices are represented with line type (solid, dashed, dotted). Specifically we compared the performance of using the 95th, 99th and 99.9th percentile of the formula image. We also compared three choices of smoothing parameters used by loess: no smoothing and smoothing windows of 9 probes (675 bp) and 15 probes (1125 bp). These are represented by colour. We assessed performance in two scenarios. (A) We inserted 10 true DMRs each 10 probes long (~750 bp) with true effect size β = 0.01. (B) As in (A), but true DMRs were 20 probes long (~1500 bp) with the same effect size
Figure 4
Figure 4
Illustration of batch effects. (A) A multidimensional scaling (MDS) plot of tumour (‘C’ label) and matched normal (‘N’ label) colon mucosa samples, processed during two different dates (green is batch 1 and orange is batch 2). Note the strong horizontal separation between the two batches. Note that the batch variability is stronger than the biological variability represented by the vertical separation between the disease states. (B) The points show methylation measurements from the colon cancer data set plotted against genomic location. Batches one and two are represented by 10 green and 6 orange points. The curves represent the smooth estimate of the batch-level methylation profiles for batch one (green) and two (orange). The horizontal lines represent a false DMR driven by batch. (C) As in (B) but after removing batch effects with SVA

Comment on

  • Int J Epidemiol.

References

    1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–45. - PubMed
    1. Mockler TC, Chan S, Sundaresan A, Chen H, Jacobsen SE, Ecker JR. Applications of DNA tiling arrays for whole-genome analysis. Genomics. 2005;85:1–15. - PubMed
    1. Arking DE, Pfeufer A, Post W, et al. A common genetic variant in the NOS1 regulator NOS1AP modulates cardiac repolarization. Nat Genet. 2006;38:644–51. - PubMed
    1. Frayling TM, Timpson NJ, Weedon MN, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316:889–94. - PMC - PubMed
    1. Kottgen A, Glazer NL, Dehghan A, et al. Multiple loci associated with indices of renal function and chronic kidney disease. Nat Genet. 2009;41:712–17. - PMC - PubMed

Publication types