Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 1;32(17):i511-i519.
doi: 10.1093/bioinformatics/btw468.

LuxGLM: a probabilistic covariate model for quantification of DNA methylation modifications with complex experimental designs

Affiliations

LuxGLM: a probabilistic covariate model for quantification of DNA methylation modifications with complex experimental designs

Tarmo Äijö et al. Bioinformatics. .

Abstract

Motivation: 5-methylcytosine (5mC) is a widely studied epigenetic modification of DNA. The ten-eleven translocation (TET) dioxygenases oxidize 5mC into oxidized methylcytosines (oxi-mCs): 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC). DNA methylation modifications have multiple functions. For example, 5mC is shown to be associated with diseases and oxi-mC species are reported to have a role in active DNA demethylation through 5mC oxidation and DNA repair, among others, but the detailed mechanisms are poorly understood. Bisulphite sequencing and its various derivatives can be used to gain information about all methylation modifications at single nucleotide resolution. Analysis of bisulphite based sequencing data is complicated due to the convoluted read-outs and experiment-specific variation in biochemistry. Moreover, statistical analysis is often complicated by various confounding effects. How to analyse 5mC and oxi-mC data sets with arbitrary and complex experimental designs is an open and important problem.

Results: We propose the first method to quantify oxi-mC species with arbitrary covariate structures from bisulphite based sequencing data. Our probabilistic modeling framework combines a previously proposed hierarchical generative model for oxi-mC-seq data and a general linear model component to account for confounding effects. We show that our method provides accurate methylation level estimates and accurate detection of differential methylation when compared with existing methods. Analysis of novel and published data gave insights into to the demethylation of the forkhead box P3 (Foxp3) locus during the induced T regulatory cell differentiation. We also demonstrate how our covariate model accurately predicts methylation levels of the Foxp3 locus. Collectively, LuxGLM method improves the analysis of DNA methylation modifications, particularly for oxi-mC species.

Availability and implementation: An implementation of the proposed method is available under MIT license at https://github.org/tare/LuxGLM/ CONTACT: taijo@simonsfoundation.org or harri.lahdesmaki@aalto.fi

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(A) The conversion chart of C, 5mC, 5hmC, 5fC and 5caC in BS-seq, oxBS-seq, TAB-seq, CAB-seq, fCAB-seq, redBS-seq and MAB-seq experiments. (B) The experimental steps of BS- and oxBS-seq experiments are represented in terms of experimental parameters. Green and red arrows depict successful and unsuccessful steps, respectively. (C) The proposed hierarchical model for modeling methylation modification proportions for BS-seq and oxBS-seq data and parts of the original Lux model represented in the plate notation. The grey and white circles are used to represent observed variables and latent variables, respectively. The grey squares represent fixed hyperparameters. The components, which model the experimental parameters and control cytosines are the same as in the Lux model (Äijö et al., 2016)
Fig. 2.
Fig. 2.
(A) Ternary plot representations of the two considered different conditions (columns) and the corresponding two batches (‘pure’ on top row; ‘garbled’ on bottom row). (B) The ternary plot shows the condition specific posterior distributions obtained using LuxGLM. The samples of θ corresponding to the ‘pure’ samples are used. The estimates of condition 1 and 2 are on left and right, respectively. The white dots and gray triangles are the MLML estimates for the ‘garbled’ and ‘pure’ samples, respectively. The analysis is done with 20 (10 ‘pure’ and 10 ‘garbled’) replicates per condition. (C) The BFs obtained using the full or reduced model are compared. The full model has covariates for the condition and batch, whereas the reduced model has only a covariate for the condition. The data in the box plots are the changes of the BFs (log2). The analysis is done either with 6 (3 ‘pure’ and 3 ‘garbled’), 10 (5 ‘pure’ and 5 ‘garbled’) or 20 (10 ‘pure’ and 10 ‘garbled’) replicates per condition. The box plots are derived from 200 random simulations
Fig. 3.
Fig. 3.
(A) Ternary plot representations of the two considered similar conditions (columns) and the corresponding two batches (‘pure’ on top row; ‘garbled’on bottom row). (B) The ternary plot shows the condition specific posterior distributions obtained using LuxGLM. The samples of θ corresponding to the ‘pure’ samples are used. The estimates of condition 1 and 2 are on top and bottom row, respectively. The white dots and gray triangles are the MLML estimates for the ‘garbled’ and ‘pure’ samples, respectively. The analysis is done with 20 (10 ‘pure’ and 10 ‘garbled’) replicates per condition. (C) The BFs obtained using the full or reduced model are compared. The full model has covariates for the condition and batch, whereas the reduced model has only a covariate for the condition. The data in the box plots are the changes of the BFs (log2). The analysis is done either with 6 (3 ‘pure’ and 3 ‘garbled’), 10 (5 ‘pure’ and 5 ‘garbled’) or 20 (10 ‘pure’ and 10 ‘garbled’) replicates per condition. The box plots are derived from 200 random simulations. (D) A receiver operating characteristics analysis of discriminative abilities of the full and reduced models. Differentially (N = 200) and similarly methylated (N = 200) cytosines are generated as in Figures 2A and 3A, respectively. The cases of 6 (3 ‘pure’ and 3 ‘garbled’), 10 (5 ‘pure’ and 5 ‘garbled’) and 20 (10 ‘pure’ and 10 ‘garbled’) replicates are considered. The cytosines are ordered based on the BFs and the receiver operating characteristics curves are derived. The areas under the curves are listed in the parentheses
Fig. 4.
Fig. 4.
(A) The posterior distributions of the parameter matrix B defined in Supplementary Equation (S15) of two CpG cytosines within the Foxp3 CNS1 locus. The prior and posterior distributions are shaded in blue and green, respectively. The red lines depict the posterior means. The log10 transformed BFs of individual covariates are listed. (B) Predicted proportions of unmodified Cnm of the cytosine chrX:7159069 in the Foxp3 CNS1 locus at different time points after the TGF-β and VitC stimuli. The posterior model parameters are estimated from BS-seq and oxBS-seq data at time points 16, 2, 48, 72 h and the predicted levels of unmethylated Cs at the time points 32, 40, 56, 64, h (shaded rectangles) are obtained using the posterior parameter samples of B. The means with the sSDs are depicted. (C) The posterior distributions of the parameter matrix B defined in Supplementary Equation (S18) of two CpG cytosines within the Foxp3 CNS1 locus. The prior and posterior distributions are shaded in blue and green, respectively. The red lines depict the posterior means. The log10 transformed BFs are listed

References

    1. Äijö T. et al. (2016) A probabilistic generative model for quantification of DNA modifications enables analysis of demethylation pathways. Genome Biol., 17, 1–22. - PMC - PubMed
    1. Baylin S.B. (2005) DNA methylation and gene silencing in cancer. Nat. Clin. Pract. Oncol., 2(Suppl 1), S4–11. - PubMed
    1. Booth M.J. et al. (2012) Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution. Science, 336, 934–937. - PubMed
    1. Booth M.J. et al. (2014) Quantitative sequencing of 5-formylcytosine in DNA at single-base resolution. Nat. Chem., 6, 435–440. - PMC - PubMed
    1. Carpenter B. et al. (in press). Stan: A probabilistic programming language. J. Stat. Softw. - PMC - PubMed