Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 13;10(1):7933.
doi: 10.1038/s41598-020-64655-4.

CNN-Peaks: ChIP-Seq peak detection pipeline using convolutional neural networks that imitate human visual inspection

Affiliations

CNN-Peaks: ChIP-Seq peak detection pipeline using convolutional neural networks that imitate human visual inspection

Dongpin Oh et al. Sci Rep. .

Abstract

ChIP-seq is one of the core experimental resources available to understand genome-wide epigenetic interactions and identify the functional elements associated with diseases. The analysis of ChIP-seq data is important but poses a difficult computational challenge, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually uncover all the peaks. Recently developed convolutional neural networks (CNN), which are capable of achieving human-like classification accuracy, can be applied to this challenging problem. In this study, we design a novel supervised learning approach for identifying ChIP-seq peaks using CNNs, and integrate it into a software pipeline called CNN-Peaks. We use data labeled by human researchers who annotate the presence or absence of peaks in some genomic segments, as training data for our model. The trained model is then applied to predict peaks in previously unseen genomic segments from multiple ChIP-seq datasets including benchmark datasets commonly used for validation of peak calling methods. We observe a performance superior to that of previous methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Visualization of labeling peaks using ChIP-seq read mapping data. For some genomic regions, human researchers can determine which regions are peaks, or not, and label them using an interface tool included in our CNN-Peaks package.
Figure 2
Figure 2
The schema of CNN-peaks model. (A) There are three Inception-like modules: Concat-A, Concat-B, and Concat-C. Each yellow box represents a convolution operation with filter size 1 * N, and black boxes concatenate filters, which are constructed by combining several convolution and pooling outputs. (B) Our model learns optimal threshold values for calling peaks in local genomic segments of ChIP-seq data and determines the presence or absence of peaks using a subtraction of the threshold from each input signal value, and a sigmoid operation. Blue arrows indicate residual connections between Inception modules, and a purple arrow an operation for expanding output vectors.
Figure 3
Figure 3
The process of peak calling with a trained model. The black signal is the read mapping depth in the ChIP-seq input data, and the blue boxes below the signal indicate the presence of genes in RefSeq annotation. An orange box is a window with both read mapping signal and RefSeq annotation in a genomic region. Peaks (in orange underlay) in the window are predicted using the model trained by CNN-Peaks, and generated in BED format.
Figure 4
Figure 4
Overview of the CNN-Peaks pipeline. Each parallelogram represents data, a rectangular box an individual module, a rounded rectangle a model trained by CNN-Peaks, and arrows data flow.
Figure 5
Figure 5
Performance comparison of CNN-Peaks to major ChIP-seq peak calling tools using our labeled testing datasets for (A) H3K27ac3 histone modification of GM12878 cell line, and (B) H3K4me3 histone modification of K562 cell line. Each blue bar represents sensitivity, and the orange bars specificity. Purple bars show the F1 scores of each peak calling software.
Figure 6
Figure 6
Relative distances between histone modification ChIP-Seq peak calling results and TSS. (A) The ideal distribution of relative distances between highly correlated intervals versus the distribution in intervals with no correlation. (B) Relative distances between peak intervals and TSS in H3K4me3 data for the HepG2 cell line. (C) Relative distances between peak intervals and TSS in H3K9ac data for the GM12878 cell line.
Figure 7
Figure 7
Peak calling performance evaluation using ChIP-seq benchmark data manually curated for SRF target with GM12878 cell line (top) and NRSF target with K562 cell line (bottom). The benchmark dataset includes 6 different types of datasets for the SRF target and NRSF target respectively. These datasets were generated using different methods as described in [16]. Each color represents a different dataset.
Figure 8
Figure 8
(A) Fraction of top N peaks within BRCA1 motif from the GM12878 cell line, (B) within CHD2 motif from K562 cell line and (C) within CTCF motif from the HepG2 cell line. Top N peaks were determined by scoring peaks called by each software. Each plot shows motif enrichment from transcription factor ChIP-seq data for BRCA1, CHD2, and CTCF respectively.

References

    1. Landt SG, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research. 2012;22:1813–1831. doi: 10.1101/gr.136184.111. - DOI - PMC - PubMed
    1. Fuery TS. ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nature Reviews Genetics. 2012;13:840–52. doi: 10.1038/nrg3306. - DOI - PMC - PubMed
    1. Valouev A, et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nature Methods. 2008;5:829–34. doi: 10.1038/nmeth.1246. - DOI - PMC - PubMed
    1. Zang C, et al. A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009;25:1952–1958. doi: 10.1093/bioinformatics/btp340. - DOI - PMC - PubMed
    1. Greer EL, Shi Y. Histone methylation: a dynamic mark in health, disease and inheritance. Nature Reviews Genetics. 2012;13:343–57. doi: 10.1038/nrg3173. - DOI - PMC - PubMed

Publication types