. 2018 Dec;19(12):e46255.

doi: 10.15252/embr.201846255. Epub 2018 Nov 9.

ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data

Shinya Oki¹, Tazro Ohta², Go Shioi³, Hideki Hatanaka⁴, Osamu Ogasawara⁵, Yoshihiro Okuda⁵, Hideya Kawaji^{6

7}, Ryo Nakaki^{8

9}, Jun Sese^{10

11}, Chikara Meno¹

Affiliations

¹ Department of Developmental Biology, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan soki@dev.med.kyushu-u.ac.jp meno@dev.med.kyushu-u.ac.jp.
² Database Center for Life Science, Joint-Support Center for Data Science Research, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
³ Genetic Engineering Team, RIKEN Center for Life Science Technologies, Kobe, Japan.
⁴ National Bioscience Database Center, Japan Science and Technology Agency, Tokyo, Japan.
⁵ DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan.
⁶ Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan.
⁷ RIKEN Preventive Medicine and Diagnosis Innovation Program, Saitama, Japan.
⁸ Genome Science Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan.
⁹ Rhelixa Inc., Tokyo, Japan.
¹⁰ Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan.
¹¹ Humanome Lab Inc., Tokyo, Japan.

PMID: 30413482
PMCID: PMC6280645
DOI: 10.15252/embr.201846255

ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data

Shinya Oki et al. EMBO Rep. 2018 Dec.

. 2018 Dec;19(12):e46255.

doi: 10.15252/embr.201846255. Epub 2018 Nov 9.

Authors

Shinya Oki¹, Tazro Ohta², Go Shioi³, Hideki Hatanaka⁴, Osamu Ogasawara⁵, Yoshihiro Okuda⁵, Hideya Kawaji^{6

7}, Ryo Nakaki^{8

9}, Jun Sese^{10

11}, Chikara Meno¹

Affiliations

¹ Department of Developmental Biology, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan soki@dev.med.kyushu-u.ac.jp meno@dev.med.kyushu-u.ac.jp.
² Database Center for Life Science, Joint-Support Center for Data Science Research, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
³ Genetic Engineering Team, RIKEN Center for Life Science Technologies, Kobe, Japan.
⁴ National Bioscience Database Center, Japan Science and Technology Agency, Tokyo, Japan.
⁵ DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan.
⁶ Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan.
⁷ RIKEN Preventive Medicine and Diagnosis Innovation Program, Saitama, Japan.
⁸ Genome Science Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan.
⁹ Rhelixa Inc., Tokyo, Japan.
¹⁰ Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan.
¹¹ Humanome Lab Inc., Tokyo, Japan.

PMID: 30413482
PMCID: PMC6280645
DOI: 10.15252/embr.201846255

Abstract

We have fully integrated public chromatin chromatin immunoprecipitation sequencing (ChIP-seq) and DNase-seq data (n > 70,000) derived from six representative model organisms (human, mouse, rat, fruit fly, nematode, and budding yeast), and have devised a data-mining platform-designated ChIP-Atlas (http://chip-atlas.org). ChIP-Atlas is able to show alignment and peak-call results for all public ChIP-seq and DNase-seq data archived in the NCBI Sequence Read Archive (SRA), which encompasses data derived from GEO, ArrayExpress, DDBJ, ENCODE, Roadmap Epigenomics, and the scientific literature. All peak-call data are integrated to visualize multiple histone modifications and binding sites of transcriptional regulators (TRs) at given genomic loci. The integrated data can be further analyzed to show TR-gene and TR-TR interactions, as well as to examine enrichment of protein binding for given multiple genomic coordinates or gene names. ChIP-Atlas is superior to other platforms in terms of data number and functionality for data mining across thousands of ChIP-seq experiments, and it provides insight into gene regulatory networks and epigenetic mechanisms.

Keywords: ChIP‐seq; DNase‐seq; data mining; enhancer; transcription factor.

PubMed Disclaimer

Figures

**Figure 1. Overview of the ChIP‐Atlas data set and computational processing**
Numbers of ChIP‐seq and DNase‐seq experiments recorded in ChIP‐Atlas (as of May 2018), indicating the proportion of the data for each species derived from ENCODE, Roadmap Epigenomics, and other projects.
Cumulative number of SRX‐based experiments recorded in ChIP‐Atlas. Data published before and after the launch of ChIP‐Atlas in December 2015 are shown in gray and black, respectively.
Numbers of experiments according to antigen (top) or cell type (bottom) classes for human, fruit fly, and nematode data. PSC, pluripotent stem cell; CDV, cardiovascular.
Overview of data processing. Raw sequence data are downloaded from NCBI SRA, aligned to a reference genome, and subjected to peak calling, all of which can be monitored with the genome browser IGV. All peak‐call data are then integrated for browsing via the “Peak Browser” function, and they can be analyzed for TR–gene (“Target Genes”) or TR–TR (“Colocalization”) interactions as well as subjected to enrichment analysis (“Enrichment Analysis”). All of the results are tagged with curated sample metadata such as antigen and cell type names. In the diagrams, gray components (circles, TRs; arrows, genes) indicate queries by the user, with colored components representing the returned results.

**Figure EV1. Web pages of ChIP‐Atlas**
A, B
A snapshot of the ChIP‐Atlas top page is shown in (A). From this page, users are able not only to access the four main functions of ChIP‐Atlas but also to search for data of interest with a given SRX ID (A, top right) or with keywords such as antigen and cell type names (B).
C
Snapshot of the Web page for the ChIP‐Atlas “Peak Browser” function. Results for the settings shown are presented in Fig 2.
D
Detailed information for SRX187209, including the sample metadata described by ChIP‐Atlas curators and the original data submitter, processing logs, and read quality from DBCLS SRA (http://sra.dbcls.jp). Blue buttons at the top are controllers for showing the alignment and peak‐call data in IGV (“View on IGV”), for downloading these data (“Download”), for viewing the analyzed data by ChIP‐Atlas “Target Genes” and “Colocalization” (“View Analysis”), and for opening external pages showing details for the experimental conditions and materials (“Link Out”). This type of Web page appears on clicking the bars in the “Peak Browser” view (Fig 2) as well as by clicking SRX IDs shown in Web pages for a keyword search (B) or for “Target Genes” (Fig 3A), “Colocalization” (Fig 3B), or “Enrichment Analysis” (Fig EV3) results.

**Figure 2. Example of processed data visualized with “Peak Browser” of ChIP‐Atlas**
ChIP‐Atlas peak‐call data for TRs around the mouse *Foxa2* locus are shown in the IGV genome browser for settings of the “Peak Browser” Web page shown in Fig EV1C. Bars represent the peak regions, with the curated names of the antigens and cell types being shown below the bars and their color indicating the score calculated with the peak‐caller MACS2 (−log₁₀[Q‐value]). Detailed sample information (yellow window) appears on placing the cursor over each bar. Clicking on the bars (asterisks) enables display of the alignment data (top) and detailed information about the experiments (Fig EV1D).

**Figure EV2. Web pages for integrative analyses in ChIP‐Atlas**
A, B
Snapshots of Web pages for ChIP‐Atlas “Target Genes” (A) and “Colocalization” (B) functions. Results for the settings shown are presented in Fig 3A and B, respectively.
C–E
Snapshots of Web pages for the ChIP‐Atlas “Enrichment Analysis” function with submission of genomic coordinates or gene symbols are shown in (C) and (D), respectively. Results for the settings shown are presented in Fig 4A–C and D–F, respectively. At the Web page for “Enrichment Analysis”, a user can submit two sets of genomic intervals in BED format (C) or gene symbols (D): data of interest in the orange area and background data for comparison in the gray area. It is also possible to filter the results according to antigen and cell type classes as well as to set a threshold for the MACS2 score. On clicking the “submit” button, the data are sent to an NIG supercomputer server for performance of the enrichment analysis, as shown in (E). For example, on submission of BED‐formatted genomic regions for hepatocyte enhancers (orange) or enhancers activated in other tissues (gray), the computational server counts the overlaps with the peaks of all SRXs (E, left). After evaluation of the significance of enrichment with Fisher's exact test (E, right), the analyzed data are returned within several minutes to the machine of the user as shown in Fig EV3.

**Figure 3. Examples of analysis with “Target Genes” and “Colocalization” of ChIP‐Atlas**
Potential target genes of *Drosophila* Pc are listed on the left with ChIP‐seq data. The colors of the cells of the matrix indicate the MACS2 scores for Pc ChIP‐seq peaks (columns) within TSS ± 1 kb regions of each potential target gene (rows). As the default, the matrix is sorted according to the average of MACS2 scores in each row (“Pc: Average” at top left). Resorting is also possible by clicking the triangles under the SRX of interest at the top (sorted result for SRX681823 is shown). This table was obtained with the queries shown in Fig EV2A.
TRs that potentially colocalize with *Drosophila* Pc are listed on the left with their ChIP‐seq information. Each cell of the matrix indicates the similarity between the ChIP‐seq data for Pc (columns) and those for its potential colocalizing partners (rows) as shown by heat colors and calculated with CoLo. As the default, the matrix is sorted according to the average of CoLo scores in each row (“Pc: Average” at top left as shown here). Sorting by an SRX of interest is possible by clicking the triangles at the top. This table was obtained with the queries shown in Fig EV2B.
IGV snapshots showing the alignment data (BigWig format) around the *Drosophila ap* and *lbe* gene loci for ChIP‐seq experiments listed on the left in (B). The results suggest that both genes might be regulated by Pc together with its colocalization partners (ph‐d, Scm, and pho). The y‐axes range from 0–10 RPM units.

**Figure 4. Analysis of TR enrichment at tissue‐specific enhancers and genes with “Enrichment Analysis” of ChIP‐Atlas**
A–F
The top 15 ChIP‐seq experiments enriched for enhancers (A–C) or genes (D–F) specifically activated in hepatocytes (A and D), blood vessel endothelial cells (B and E), or macrophages (C and F) relative to all other FANTOM5 enhancers (A–C) or RefSeq coding genes (D–F) are shown. The bar charts indicate P‐values for enrichment, with the colors indicating the cell types examined in the experiments according to the palette shown in Fig EV4, where the top 50 ChIP‐seq experiments enriched for the above and other enhancers are also presented. Asterisks next to SRX IDs indicate that the ChIP‐seq data originated from projects other than ENCODE or Roadmap Epigenomics.

**Figure EV3. Examples of “Enrichment Analysis”**
A, B
Snapshots of the results for enrichment analysis of hepatocyte‐specific enhancers with the ChIP‐Atlas “Enrichment Analysis” function, for which other FANTOM5 enhancers (A) or randomly permutated regions (B) were set as background, are shown. The first row in (A), for example, indicates EP300 ChIP‐seq data (SRX100544) for Hep G2 cells. The total number of peaks for EP300 is 24,334, of which 80 peaks overlap with hepatocyte‐specific enhancers (n = 286) and 1,147 peaks overlap with other enhancers (n = 20,509), yielding a P‐value of 1 × 10^−32.1 (Fisher's exact probability test), Q‐value of 1 × 10^−28.3 (Benjamini and Hochberg method), and fold enrichment of 5.00. The table is sorted according to P‐value, with HNF4A/G and FOXA1/2 in Hep G2 being ranked 3rd, 7th, 8th, 10th, and 12th. The table is also graphically summarized in Figs 4A and EV4 (top 15 and 50 experiments, respectively), in which each row of the table is represented by a bar to indicate the P‐value. Note that TR peaks overlapped to a lesser extent with random background (B) than with other FANTOM5 enhancers (A).

**Figure EV4. Results of enrichment analysis for tissue‐specific enhancers**
The results of enrichment analysis for FANTOM5 tissue‐specific enhancers are sorted according to the minimum P‐value (P _min) for each facet. The bar charts indicate P‐value (horizontal axis) for the top 50 enriched ChIP‐seq experiments (vertical axis), with the colors denoting cell type classes according to the color palette (bottom right).

**Figure EV5. Results of enrichment analysis with other tools**
A, B
Enrichment analysis for hepatocyte enhancers was performed with LOLAweb (A), for which other FANTOM5 enhancers were used as background, and with “Annotation Tool” of ReMap (B), for which only random regions were available as background. The top 15 ChIP‐seq experiments showing statistical significance are shown as bar charts.

See this image and copyright information in PMC

References

1. Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M, Chen X, Taipale J, Hughes TR, Weirauch MT (2018) The human transcription factors. Cell 172: 650–665 - PubMed
1. Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE (2008) High‐resolution mapping and characterization of open chromatin across the genome. Cell 132: 311–322 - PMC - PubMed
1. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D et al (2006) Genome‐wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res 16: 123–131 - PMC - PubMed
1. Consortium EP (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74 - PMC - PubMed
1. Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K et al (2010) Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330: 1775–1787 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data

Affiliations

ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous