Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 3;15(12):503.
doi: 10.1186/s13059-014-0503-2.

Functional normalization of 450k methylation array data improves replication in large cancer studies

Functional normalization of 450k methylation array data improves replication in large cancer studies

Jean-Philippe Fortin et al. Genome Biol. .

Abstract

We propose an extension to quantile normalization that removes unwanted technical variation using control probes. We adapt our algorithm, functional normalization, to the Illumina 450k methylation array and address the open problem of normalizing methylation data with global epigenetic changes, such as human cancers. Using data sets from The Cancer Genome Atlas and a large case-control study, we show that our algorithm outperforms all existing normalization methods with respect to replication of results between experiments, and yields robust results even in the presence of batch effects. Functional normalization can be applied to any microarray platform, provided suitable control probes are available.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Control probes acts as surrogates for batch effects. (a) Heat map of a summary (see Materials and methods) of the control probes, with samples on the y-axis and control summaries on the x-axis. Samples were processed on a number of different plates indicated by the color label. Only columns have been clustered. (b) The first two principal components of the matrix depicted in (a). Samples partially cluster according to batch, with some batches showing tight clusters and other being more diffuse. (c) The distribution of methylated intensities averaged by plate. These three panels suggest that the control probe summaries partially measure batch effects. PC, principal component.
Figure 2
Figure 2
Improvements in replication for the EBV data set. (a) ROC curves for replication between a discovery and a validation data set. The validation data set was constructed to show in silico batch effects. The dotted and solid lines represent, respectively, the commonly used false discovery rate cutoffs of 0.01 and 0.05. (b) Concordance curves showing the percentage overlap between the top k DMPs in the discovery and validation cohorts. Additional normalization methods are assessed in Additional file 1: Figure S3. Functional normalization shows a high degree of concordance between data sets. (c) The percentage of the top 100,000 DMPs that are replicated between the discovery and validation cohorts and also inside a differentially methylated block or region from Hansen et al. [38]. DMP, differentially methylation position; EBV, Epstein–Barr virus; Funnorm, functional normalization; ROC, receiver operating characteristic.
Figure 3
Figure 3
Improvements in replication for the TCGA-KIRC data set. (a) ROC curves for replication between a discovery and a validation data set. The validation data set was constructed to show in silico batch effects. (b) Concordance plots between an additional cohort assayed on the 27k array and the validation data set. Additional normalization methods are assessed in Additional file 1: Figure S4. Functional normalization shows a high degree of concordance between data sets. Funnorm, functional normalization; KIRC, kidney clear-cell carcinoma; ROC, receiver operating characteristic; TCGA, The Cancer Genome Atlas.
Figure 4
Figure 4
Improvements in replication of tumor subtype heterogeneity. In the AML data set from TCGA, the same samples have been assayed on 450k and 27k arrays. (a) Concordance plots between results from the 450k array and the 27k array. (b) ROC curves for the 450k data, using the results from the 27k data as gold standard. AML, acute myeloid leukemia; Funnorm, functional normalization; ROC, receiver operating characteristic; TCGA, The Cancer Genome Atlas.
Figure 5
Figure 5
Performance improvements on blood samples data set. (a) ROC curve for replication of case–control differences between blood samples from colon cancer patients and blood samples from normal individuals, the Ontario-Blood data set. The validation data set was constructed to show an in silico batch effect. (b) ROC curve for identification of probes on the sex chromosomes for the Ontario-Sex data set. Sex is confounded by an in silico batch effect. Both evaluations show the good performance of functional normalization. Funnorm, functional normalization; ROC, receiver operating characteristic.
Figure 6
Figure 6
Variance across technical triplicates. Box plots of the probe-specific variances estimated across 19 individuals assayed in technical triplicates. All normalization methods improve upon raw data, and functional normalization performs well. funnorm, functional normalization; w, with.
Figure 7
Figure 7
Spatial location affects overall methylation. Quantiles of the beta distributions adjusted for a slide effect. The 12 vertical stripes are ordered as rows 1 to 6 in column 1 followed by rows 1 to 6 in column 2. (a) 10th percentile for type II probes for the unnormalized AML data set. (b) 15th percentile for type I probes for the unnormalized AML data set. (c) 85th percentile for type II probes for the unnormalized Ontario-EBV data set. (a–c) show that the top of the slide has a different beta distribution from the bottom. (d–f) Like (a–c) but after functional normalization, which corrects this spatial artifact. AML, acute myeloid leukemia.
Figure 8
Figure 8
Comparison to batch effect removal tools SVA, RUV and ComBat. (a) Like Figure 2a, an ROC curve for the Ontario-EBV data set. (b) Like Figure 3a, an ROC curve for the TCGA-KIRC data set. (c) Like Figure 3b, a concordance curve between the validation cohort from 450k data and the 27k data for the TCGA-KIRC data set. (d) Like Figure 4a, concordance plots between results from the 450k array and the 27k array for the TCGA-AML data set. (e) Like Figure 5a, an ROC curve for the Ontario-Blood data set. AML, acute myeloid leukemia; EBV, Epstein–Barr virus; Funnorm, functional normalization; ROC, receiver operating characteristic.
Figure 9
Figure 9
Effect size of the top replicated loci. Box plots represent the effect sizes for the top k loci from the discovery cohort that are replicated in the validation cohort. The effect size is measured as the difference on the beta value scale between the two treatment group means. (a) Box plots for the top k=100000 loci replicated in the Ontario-EBV data set. (b) Box plots for the top k=100000 loci replicated in the TCGA-KIRC data set. EBV, Epstein–Barr virus; Funnorm, functional normalization; w, with.
Figure 10
Figure 10
Sample size simulation for the Ontario-EBV data set. Partial discovery–validation ROC curves for the Ontario-EBV data set similar to Figure 2a but for random subsamples of different sizes n=10,20,30,50 and 80. Each solid line represents the mean of the ROC results for B=100 subsamples of size n. The dotted lines represent the 0.025 and 0.975 percentiles. EBV, Epstein–Barr virus; Funnorm, functional normalization; ROC, receiver operating characteristic.

References

    1. Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan JB, Shen R. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98:288–295. doi: 10.1016/j.ygeno.2011.07.007. - DOI - PubMed
    1. Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12:529–541. doi: 10.1038/nrg3000. - DOI - PMC - PubMed
    1. Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, Reinius L, Acevedo N, Taub M, Ronninger M, Shchetynsky K, Scheynius A, Kere J, Alfredsson L, Klareskog L, Ekström TJ, Feinberg AP. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013;31:142–147. doi: 10.1038/nbt.2487. - DOI - PMC - PubMed
    1. Feinberg AP, Vogelstein B. Hypomethylation distinguishes genes of some human cancers from their normal counterparts. Nature. 1983;301:89–92. doi: 10.1038/301089a0. - DOI - PubMed
    1. Gama-Sosa MA, Slagel VA, Trewyn RW, Oxenhandler R, Kuo KC, Gehrke CW, Ehrlich M. The 5-methylcytosine content of DNA from human tumors. Nucleic Acids Res. 1983;11:6883–6894. doi: 10.1093/nar/11.19.6883. - DOI - PMC - PubMed

Publication types