Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb 24;13(2):e1005403.
doi: 10.1371/journal.pcbi.1005403. eCollection 2017 Feb.

Imputation for transcription factor binding predictions based on deep learning

Affiliations

Imputation for transcription factor binding predictions based on deep learning

Qian Qin et al. PLoS Comput Biol. .

Abstract

Understanding the cell-specific binding patterns of transcription factors (TFs) is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4%) of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. The TFImpute model.
Each input is a TF-cell-sequence triple. In the convolution layer, each filter (motif) corresponds to a column. Each filter scans the input sequence and produces one value at each stop. For each filter, the max-pooling layer partitions the signal into three windows and takes the maximum value in each window to obtain three values. The same gate signal operates on the three values, and the gate signal is different for different filters. For each input, the reverse complement of the input sequence together with the TF and cell line is constructed and used as another input for the same network. Therefore, for each input, we obtained two values for forward and reverse strand of the sequence: P1 and P2. The maximum of P1 and P2 is taken as the final prediction. During training, the prediction was compared with the target, and the error was back-propagated to learn the parameters of the whole network.
Fig 2
Fig 2. AUC comparison of TFImpute with DeepBind and gkm-SVM using shuffled sequences as negative instances.
(A) Comparison with DeepBind. Each point in the figure corresponds to a TF-cell line combination. (B) AUC for TF-cell line combinations in which DeepBind gives the lowest AUC. (C) Comparison with gkm-SVM using randomly shuffled sequences as negative instances. Each point in the figure corresponds to a TF-cell line combination.
Fig 3
Fig 3. Comparison with gkm-SVM, PIQ, and DeepSEA.
(A) AUC comparison of TFImpute and gkm-SVM on TestSet1, TestSet2, and TestSet3. ‘Shuf cell line indicates that the cell line of the corresponding test set was shuffled and that the trained TFImpute model was then applied to the shuffled dataset. Similarly, ‘Shuf TF’ indicates that the TFs were shuffled. For some of the given regions, PIQ give NA predictions. NA means that there is no motif based on log probability threshold of 5, or the region is lack of DNase I signal. PIQNoNA in this figure denotes the result after removing all NAs and PIQ denotes the result after treating NAs as no binding. To calculate the AUC, the predictions were grouped by TFs. The middle bar in each box indicates the median. (B) AUC comparison based on predictions grouped by TF-cell line combinations. (C) The recall rates of different methods at FDR 0.05 (See Material and methods for more details). The predictions were grouped by TFs. (D) AUC comparison of TFImpute on TFs appearing in both TestSet2 and TestSet3. (E) Hierarchical clustering of a subset of the TFs based on the learned embedding by TFImpute. The full clustering is shown in S3 Fig. (F) Hierarchical clustering of a subset of cell lines based on the learned embedding by TFImpute. The full clustering is shown in S4 Fig. (G) The recall rate of TFImpute and DeepSEA at different FDR cutoffs on the datasets provided by DeepSEA.
Fig 4
Fig 4. The distributions of the calculated enhancer signature for the top and bottom 100 enhancers.
The p value is calculated using t-test. We would like to emphasize the lack of data of the enhancer reporter assay of GM12878, which is a good control.
Fig 5
Fig 5. Predicted binding affinity change between two alleles of SNP rs12740374 (T/G).
The color in each cell represents the predicted binding affinity of allele G minus that of allele T for the corresponding TF and cell line. The number in each cell of the heatmap is the number of ChIP-seq datasets in the training set for the corresponding TF and cell line. If TFImpute predicted strong binding in the minor allele but no binding in the major allele, the score was 1. If TFImpute predicted no binding difference between the two alleles, the score was 0.

Similar articles

Cited by

References

    1. Mathelier A, Shi W, Wasserman WW. Identification of altered cis-regulatory elements in human disease. Trends in Genetics. 2015;31: 67–76. 10.1016/j.tig.2014.12.003 - DOI - PubMed
    1. Weinhold N, Jacobsen A, Schultz N, Sander C, Lee W. Genome-wide analysis of noncoding regulatory mutations in cancer. Nat Genet. 2014;46: 1160–1165. 10.1038/ng.3101 - DOI - PMC - PubMed
    1. Friedensohn S, Sawarkar R. Cis-regulatory variation: significance in biomedicine and evolution. Cell Tissue Res. 2014;356: 495–505. 10.1007/s00441-014-1855-3 - DOI - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316: 1497–1502. 10.1126/science.1141319 - DOI - PubMed
    1. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489: 57–74. 10.1038/nature11247 - DOI - PMC - PubMed

Substances

LinkOut - more resources