. 2017 May 15:10:26.

doi: 10.1186/s13072-017-0131-7. eCollection 2017.

A computational approach for the functional classification of the epigenome

Francesco Gandolfi¹, Anna Tramontano^{1

2}

Affiliations

¹ Department of Physics, Sapienza University of Rome, Piazzale Aldo Moro 2, 00185 Rome, Italy.
² Istituto Pasteur Italia - Fondazione Cenci Bolognetti, Viale Regina Elena 291, 00161 Rome, Italy.

PMID: 28515787
PMCID: PMC5433140
DOI: 10.1186/s13072-017-0131-7

A computational approach for the functional classification of the epigenome

Francesco Gandolfi et al. Epigenetics Chromatin. 2017.

. 2017 May 15:10:26.

doi: 10.1186/s13072-017-0131-7. eCollection 2017.

Authors

Francesco Gandolfi¹, Anna Tramontano^{1

2}

Affiliations

¹ Department of Physics, Sapienza University of Rome, Piazzale Aldo Moro 2, 00185 Rome, Italy.
² Istituto Pasteur Italia - Fondazione Cenci Bolognetti, Viale Regina Elena 291, 00161 Rome, Italy.

PMID: 28515787
PMCID: PMC5433140
DOI: 10.1186/s13072-017-0131-7

Abstract

Background: In the last decade, advanced functional genomics approaches and deep sequencing have allowed large-scale mapping of histone modifications and other epigenetic marks, highlighting functional relationships between chromatin organization and genome function. Here, we propose a novel approach to explore functional interactions between different epigenetic modifications and extract combinatorial profiles that can be used to annotate the chromatin in a finite number of functional classes. Our method is based on non-negative matrix factorization (NMF), an unsupervised learning technique originally employed to decompose high-dimensional data in a reduced number of meaningful patterns. We applied the NMF algorithm to a set of different epigenetic marks, consisting of ChIP-seq assays for multiple histone modifications, Pol II binding and chromatin accessibility assays from human H1 cells.

Results: We identified a number of chromatin profiles that contain functional information and are biologically interpretable. We also observe that epigenetic profiles are characterized by specific genomic contexts and show significant association with distinct genomic features. Moreover, analysis of RNA-seq data reveals that distinct chromatin signatures correlate with the level of gene expression.

Conclusions: Overall, our study highlights the utility of NMF in studying functional relationships between different epigenetic modifications and may provide new biological insights for the interpretation of the chromatin dynamics.

Keywords: Chromatin profiles; Epigenetic mark combinations; NMF.

PubMed Disclaimer

Figures

**Fig. 1**
Non-negative matrix factorization of epigenetic data. The scheme gives an intuitive representation of how NMF can be used to approximate a multivariate epigenetic signal in a pre-defined number of signal patterns. The algorithm takes as input a data-matrix (V) with rows corresponding to a series of genomic intervals (or loci) and columns corresponding to different epigenetic tracks for the marks. Each cell in the matrix defines the normalized/background corrected signal of a given epigenetic mark (y) in a given locus (x) (a). As result, a standard NMF procedure yields two sparse matrices W (the weight matrix) and H (the coefficient matrix) describing the contribution of each code/profile to single loci and single marks respectively (b)

**Fig. 2**
Chromatin patterns definition and interpretation. *Upper panel* a: color-scale heatmap showing the hierarchical clustering of 13 different epigenetic marks on the coefficient matrix H obtained with seven factorization ranks. Each cell (x, y) in the matrix indicates a pattern coefficient reflecting the contribution of the code in X to the epigenetic track defined in Y. Hierarchical clustering analysis clearly identifies different subgroups of *marks* b: *color-scale* heatmap and hierarchical clustering using the average normalized signal of each *mark* (*columns*) across all genomic intervals of a given profile (*rows*). Average signals are centered and scaled such that the mean of the epigenetic mark in each *column* is zero

**Fig. 3**
Enrichment of chromatin profiles with respect to genomic features. The heatmap represents the enrichment of each chromatin profile (*rows*) compared to different regions of the gene and distinct types of genomic features in the genome (*columns*). The enrichment is defined as a log-odd ratio as described in “Methods”. Positive associations (odd-ratio >1) are colored from *green*/*yellow* to brown whereas negative associations (odd-ratio <1) are indicated in blue. As shown in the heatmap, each combinatorial profile reveals a distinct pattern of enrichment, thus demonstrating the usefulness of the NMF-approach in the biological interpretation of the different chromatin functions. In this heatmap, each profile is associated to a specific biological label in order to facilitate the mnemonic association between the profile and its functional role on the basis of the observed enrichment (*top-bottom*): *ActProm* = Active Promoter (profile 1), *TxInit* = Transcription Initiation (profile 3), *RepReg* = Repressed Regulatory Regions (profile 4), *Ehn* = Enhancer Regions (profile 6), *RegEl* = Regulatory Elements (profile 7), *GenBd* = Gene Body Transcription (profile 5), *RepChr* = Repressed Chromatin (profile 2). Genomic features indicated in the columns are: CAGE = hESC-H1 CAGE clusters from ENCODE; RfTSS = Refseq Transcription Start Sites; RfTES = Refseq Transcription End Sites; 5UTR=Refseq 5’untranslated region; 3UTR = Refseq 3’unstranslated regions; H1 Enhancers = Superenhancer regions from hESC; CpG = CpG islands; Upstream = 1Kb upstream regions from Refseq TSSs; DNase1 = hESC DNase1 Hypersensitive sites from ENCODE; TFBS = Conserved transcription factor binding sites from the Transfac Matrix Database; 5C = Chromatin conformation capture carbon copy data from hESC; EnhancersDB = experimentally validated enhancer elements from the VistaEnhancer Dabatabse; Rf = Refseq genes; Int = intronic sequences from Refseq genes; Ex = exonic sequences from Refseq genes; PolyA = predicted poly-adenylation sites; sRNA = small RNAs; HMMhetero = predicted heterochromatin regions in hESC

**Fig. 4**
Frequency distribution of chromatin profiles around the transcription start site (TSS) and the gene middle-point (GM). The histograms plot the distribution of the different chromatin profiles around the TSS (a) and the central position of Refseq genes (b). Both distributions are generated on the basis of the observed distance (bp) of each bin to the closest TSS (corresponding to 0 on the x-axis) or GM. As shown in a, two major epigenetic profiles (ActProm and TxInit) are enriched around the promoter region of the gene. In b the genomic distances are normalized to the gene length so that the middle-point of the gene is always at the 0 position, the gene length is normalized from −50 to +50

**Fig. 5**
Recovery power of chromatin profiles compared to single chromatin marks. The plots show the Receiver Operating Characteristic (ROC) curve for the ability of different chromatin profiles and single marks in recovering Refseq TSSs (a), Refseq upstream regions (b) and experimentally validated enhancers (c)

**Fig. 6**
Co-occurrences of chromatin profiles in bin assignment. Each heatmap in the figure is a 7 × 7 matrix showing the frequency of co-occurrence of each profile compared to each other in all possible pairs at different RWC (Relative Weight Contribution) thresholds (0.95; 0.85; 0.75; 0.5). Chromatin profiles are reported in *either rows* or *columns* using the same labels previously adopted in the manuscript. Profiles from the original bin assignment are reported in the *rows*, whilst additional profiles (*columns*) are progressively co-assigned as the RWC threshold decreases

**Fig. 7**
Association between patterns of chromatin profiles and expression levels. a Heatmaps showing the hierarchical clustering of sets of genes with similar expression levels on the basis of the chromatin profile frequencies in a region of ±2Kb around the transcription start site. For each profile, frequencies are reported in a separate heatmap, with the corresponding *color*-label positioned on the *top* of each matrix. Each *row* in the matrix corresponds to a specific range of expression and is represented by all genes with an RPKM signal in that interval. Intervals are reported as ranges of percentiles derived from the RPKM distribution. Expression ranges are indicated using *color-scale* labels (on the *left*) from *black* (*lowest*) to violet (*highest*). In each heatmap, a region of ±2Kb around the gene TSS is reported in the *columns* (in 200bp bins). Each cell in the heatmap shows the logarithmic fold-change of the observed frequency over that of the random dataset. *Black cells* indicate a null fold-change (around 1), whilst red and blue reflect positive and negative enrichment, respectively. Unsupervised hierarchical clustering identifies five different sub-clusters that mirror well the different extent of expression. b–f Average enrichment of profiles in every subcluster. Each cell in a heatmap shows the enrichment of a given profile in a given bin over the 4kb-TSS surrounding region, averaged across all genes in the subcluster. The same *color-scale* as in (a) is used

**Fig. 8**
Overlap between chromatin profiles and putative TF binding sites from hESCH1 ChIP-seq data. a The heatmap shows the extent of overlap between ChIP-seq peaks from each transcription factor (along *rows*) and the distribution of a given chromatin profile in both the observed and the random data (*columns*). For each possible combination of TF/profile, a fold-enrichment is calculated following the procedure described in “Methods”. b Heatmap showing the significance of the enrichment for the same combinations of TFs/profiles represented in (a). The *color-scale* indicates the associated p value on the basis of the Fisher’s exact test (reported in the −log₁₀ form): *black*: p > 0.01; *brown* 0.01 > p > 0.0001; *dark red* 0.0001 > p > 10⁻⁵; *red p* < 10⁻⁵

**Fig. 9**
Overlap and coverage levels of different genomic elements using enriched profiles/states in each method. Chromatin segmentation approaches are reported in columns, genomic features in rows. Each cell in the heatmap indicates the amount of overlap (a) or coverage (b) observed intersecting a feature with any profile/state specifically enriched in that feature using NMF or ChromHMM-based methods. The extent of overlap is represented as the mean percentage by which an enriched profile/state overlaps the feature (a). Similarly, we represent the coverage as the mean percentage of a given feature covered by any enriched profile/state. The *color-scale* (from *green* to *purple*) mirrors the amount of information retrieved for each pair of feature/method. Genomic features labels indicate: Upstream = 1kb upstream region from the Refseq TSS; CpG = CpG islands; RfTSS = genomic window of ±50bp (“Methods”) around the Refseq TSS; RfTES = genomic window of ±50bp around the Refseq TES; RfGenes(exp <25%) = Refseq genes with mean RPKM value smaller than the 25th percentile; RfGenes (exp >75%) = Refseq genes with mean RPKM value higher than the 75th percentile; PolyA = polyAdenylation-sites from PolyA-database; TFBS = conserved transcription factor binding sites; H1-enhancers = super enhancers regions precited in hESC from the dbSuper database; vistaEnhancers = experimentally validated enhancer regions in human; DNaseI = DNase hyper-sensitive sites from ENCODE project database; sRNA = predicted small RNAs

**Fig. 10**
Examples of genomic visualization of the NMF-based epigenetic profiles on the UCSC Genome Browser. Chromatin profiles are compared with the UCSC Refseq gene annotation tracks at the TMEM139/CASP2 and the MKRN1 loci. NMF-profiles are highlighted with the same color scheme adopted in this work and are displayed in the first track of the panel, as indicated by the yellow arrow on the right. a Genomic visualization of chromatin profiles over a 25Kb-region encompassing two different genomic loci: the TMEM139 (transmembrane protein 139) and the CASP2 gene. A specific chromatin transition ‘ActProm > TxInit’ is found exactly on the TSS of the CASP2 gene, suggesting the presence of a functionally active promoter. Chromatin profile GenBd is also detected multiple times on both intronic and exonic regions of the gene, indicating that the CASP2 is transcriptionally active in hESC-H1 cells. The NMF-approach also identified a repressive chromatin region (profile RepReg) on the 3′ end and a potential enhancer element over the 5′ end of the TMEM139 gene, that are also confirmed by ChromHMM predictions in the bottom track. b Chromatin profiles at the MKRN1 locus (the Makorin ring finger protein1). MKRN1 appears to be well-expressed in hESC-H1 cells as indicated by the ‘TxInit > ActProm > TxInit’ chromatin motif over the TSS region and the Gene Body Transcription profile that frequently appears in both introns and the last exons of the gene. On the left of the figure, a putative active enhancer (i.e. the chromatin profile sequence ‘Enh > ActProm > Enh’) is predicted over the 3′ end of the gene. This prediction appears to be concordant with the ChromHMM annotation, as indicated by the ‘strong-enhancer’ label in the corresponding chromatin segmentation track. Finally, a putative CTCF-binding region (profile RegEl) also appears in the first intron, suggesting a functional role in the control of MKRN1

See this image and copyright information in PMC

References

1. Strahl BD, Allis CD. The language of covalent histone modifications. Nature. 2000;403(6765):41–45. doi: 10.1038/47412. - DOI - PubMed
1. Margueron R, Reinberg D. Chromatin structure and the inheritance of epigenetic information. Nat Rev Genet. 2010;11(4):285–296. doi: 10.1038/nrg2752. - DOI - PMC - PubMed
1. Bernstein BE, Meissner A, Lander ES. The mammalian epigenome. Cell. 2007;128(4):669–681. doi: 10.1016/j.cell.2007.01.033. - DOI - PubMed
1. Wang Z, et al. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet. 2008;40(7):897–903. doi: 10.1038/ng.154. - DOI - PMC - PubMed
1. The-Encode-Project-Consortium, The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004; 306(5696):636–40. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A computational approach for the functional classification of the epigenome

Affiliations

A computational approach for the functional classification of the epigenome

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources