Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape

Aurélien Griffon¹, Quentin Barbier¹, Jordi Dalino¹, Jacques van Helden¹, Salvatore Spicuglia¹, Benoit Ballester²

Affiliations

¹ INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France.
² INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France benoit.ballester@inserm.fr.

PMID: 25477382
PMCID: PMC4344487
DOI: 10.1093/nar/gku1280

Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape

Aurélien Griffon et al. Nucleic Acids Res. 2015.

. 2015 Feb 27;43(4):e27.

doi: 10.1093/nar/gku1280. Epub 2014 Dec 3.

Authors

Aurélien Griffon¹, Quentin Barbier¹, Jordi Dalino¹, Jacques van Helden¹, Salvatore Spicuglia¹, Benoit Ballester²

Affiliations

¹ INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France.
² INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France benoit.ballester@inserm.fr.

PMID: 25477382
PMCID: PMC4344487
DOI: 10.1093/nar/gku1280

Abstract

The large collections of ChIP-seq data rapidly accumulating in public data warehouses provide genome-wide binding site maps for hundreds of transcription factors (TFs). However, the extent of the regulatory occupancy space in the human genome has not yet been fully apprehended by integrating public ChIP-seq data sets and combining it with ENCODE TFs map. To enable genome-wide identification of regulatory elements we have collected, analysed and retained 395 available ChIP-seq data sets merged with ENCODE peaks covering a total of 237 TFs. This enhanced repertoire complements and refines current genome-wide occupancy maps by increasing the human genome regulatory search space by 14% compared to ENCODE alone, and also increases the complexity of the regulatory dictionary. As a direct application we used this unified binding repertoire to annotate variant enhancer loci (VELs) from H3K4me1 mark in two cancer cell lines (MCF-7, CRC) and observed enrichments of specific TFs involved in biological key functions to cancer development and proliferation. Those enrichments of TFs within VELs provide a direct annotation of non-coding regions detected in cancer genomes. Finally, full access to this catalogue is available online together with the TFs enrichment analysis tool (http://tagc.univ-mrs.fr/remap/).

PubMed Disclaimer

Figures

**Figure 1.**
ChIP-seq binding pattern of 395 data sets. (A) A genome browser example of complex ChIP-seq binding patterns of the 395 data sets at the SMAD4/ELAC1 promoters, and a detailed view of the redundant peaks for a FOXA1 site. The following genome tracks correspond to the ChIP-seq peak summits (black vertical lines), the 100 vertebrates conservation track from UCSC and the condensed ENCODE TF bindings. (B) Co-binding correlation patterns of the 395 data sets are clustered and shown as a heatmap with blue to red indicating low to high correlations for each co-localized data sets. Co-binding relationships between TFs and cell types across all data sets are observable. Co-localization clusters are highlighted with coloured bars and (C) some clustered data sets are shown in details (e.g. ESR1 in MCF-7 cells).

**Figure 2.**
ChIP-seq peaks and CRMs. (A) A schematic diagram of the three types of regulatory regions: all peaks, non-redundant peaks and CRMs. Peaks for similar TFs overlapping the same regions were merged into single peaks defined as non-redundant. For each genomic region bound by at least two different TFs, those bindings were regrouped into CRMs. (B) Proportion of single and combined binding sites observed after identification of CRMs. The vertical barplot correspond to proportion of CRMs found in combinatorial binding categories across all identified CRMs. (C) Genomic distribution of single or combined binding sites in six different genomic regions. The percentage of binding sites in each category is shown on the vertical axis, for the overall genome, singletons and each combinatorial binding complexity from 2 to 50+ TFs. (D) Distribution of CRMs at TSS (±2.5 kb) for increasing levels of combinatorial binding complexity from 2 to 50+ TFs. (E) Proportion of our regulatory catalogue covering different types of genomic features. Percentages of elements recovered are shown for CRMs only (green) and both CRMs and singletons (blue). (F) The WebLogo position weight matrix diagrams for CTCF identified across the diverse databases, showing subtle position-specific differences. (G) DNA sequence constraint around the peak summits of FOXA1, CTCF, CEBPA, NFYB were plotted by observed-expected GERP scores (22).

**Figure 3.**
Comparison with ENCODE and integration with public data. (A) Comparisons of public regulatory regions versus ENCODE regions. The vertical barplots correspond to the proportion of TFBS from the integrative analysis of public data that can be recapitulated in the ENCODE CRMs and singletons. ‘No overlap’ corresponds to potential novel regulatory regions. Overlap analyses are performed both ways. (B) A genome browser example of binding patterns from public data only, and complemented patterns with the public and ENCODE merge. The following genome tracks correspond to the ChIP-seq peak summits (black vertical lines) and the 100 vertebrates conservation track from UCSC. (C) Venn diagrams of TFs, CRMs and regulatory features (CRMs and singletons) between the public set and ENCODE. (D) Genomic distribution of single or combined binding sites in six different genomic regions. The percentage of binding sites in each category is shown on the vertical axis for singletons and each combinatorial binding complexity from 2 to 100+ TFs. (E) Saturation analysis of the ReMap data with increasing numbers of TFs. The plot is generated from the merge of both public and ENCODE TFBS catalogues. This plot illustrates the saturation of CRMs identified by TF ChIP-seq as additional factors are analysed across the multi-cell integrative analysis. We calculate CRMs counts across the genome from an increasing number of TFs randomly selected. The distribution of CRMs counts for 100 TFs selection is plotted as a boxplot on the x-axis. We continue to do this for all incremental steps up to and including all TFs. A lowess line smoothing the medians of the CRMs count is highlighted in orange.

**Figure 4.**
Network representations of TFs co-localization across the genome. (A) In this filtered TF co-localization network, nodes indicate individual TFs and colours indicate subnetworks identified by applying a partitioning algorithm; edge colours depict the percentages of overlap between TFBS and weights the co-localization specificity between two TFs. Overlapping binding sites were computed using IntervalStats tool and co-localization specificity was determined by identifying outliers based on the percentages of significant overlapping sites. (B) Highlighted subnetworks of highly connected and strongly specific TFs with functional annotations. Barplots represent Gene Ontology *Biological Process* enrichments calculated by DAVID (x-axis = −log10 Benjamini score).

**Figure 5.**
Specific TFs signature in VELs. (A) UCSC browser views of H3K4me1 profile of a normal mammary epithelial cell line (MCF-10A) and a breast cancer cell line (MCF-7), illustrating an example of a gained (left in red) and a lost (right in blue) VELs. (B) Similar view of H3K4me1 profile for a primary CRC (CRC V400) and a normal colon epithelium crypt (C104). (C) TFs specifically enriched within regions defined as gained or lost VELs in MCF-7 and CRC (D) cell lines.

See this image and copyright information in PMC

References

1. Hoffman M.M., Ernst J., Wilder S.P., Kundaje A., Harris R.S., Libbrecht M., Giardine B., Ellenbogen P.M., Bilmes J.A., Birney E., et al. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 2013;41:827–841. - PMC - PubMed
1. Negre N., Brown C.D., Ma L., Bristow C.A., Miller S.W., Wagner U., Kheradpour P., Eaton M.L., Loriaux P., Sealfon R., et al. A cis-regulatory map of the Drosophila genome. Nature. 2011;471:527–531. - PMC - PubMed
1. Shen Y., Yue F., McCleary D.F., Ye Z., Edsall L., Kuan S., Wagner U., Dixon J., Lee L., Lobanenkov V.V., et al. A map of the cis-regulatory sequences in the mouse genome. Nature. 2012:116–120. 488. - PMC - PubMed
1. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed
1. Neph S., Vierstra J., Stergachis A.B., Reynolds A.P., Haugen E., Vernot B., Thurman R.E., John S., Sandstrom R., Johnson A.K., et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012;489:83–90. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Medical
- ClinicalTrials.gov
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape

Affiliations

Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases