. 2017 Feb 24;13(2):e1005403.

doi: 10.1371/journal.pcbi.1005403. eCollection 2017 Feb.

Imputation for transcription factor binding predictions based on deep learning

Qian Qin¹, Jianxing Feng¹

Affiliations

PMID: 28234893
PMCID: PMC5345877
DOI: 10.1371/journal.pcbi.1005403

Imputation for transcription factor binding predictions based on deep learning

Qian Qin et al. PLoS Comput Biol. 2017.

. 2017 Feb 24;13(2):e1005403.

doi: 10.1371/journal.pcbi.1005403. eCollection 2017 Feb.

Authors

Qian Qin¹, Jianxing Feng¹

Affiliation

¹ Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China.

PMID: 28234893
PMCID: PMC5345877
DOI: 10.1371/journal.pcbi.1005403

Abstract

Understanding the cell-specific binding patterns of transcription factors (TFs) is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4%) of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. The TFImpute model.**
Each input is a TF-cell-sequence triple. In the convolution layer, each filter (motif) corresponds to a column. Each filter scans the input sequence and produces one value at each stop. For each filter, the max-pooling layer partitions the signal into three windows and takes the maximum value in each window to obtain three values. The same gate signal operates on the three values, and the gate signal is different for different filters. For each input, the reverse complement of the input sequence together with the TF and cell line is constructed and used as another input for the same network. Therefore, for each input, we obtained two values for forward and reverse strand of the sequence: P1 and P2. The maximum of P1 and P2 is taken as the final prediction. During training, the prediction was compared with the target, and the error was back-propagated to learn the parameters of the whole network.

**Fig 2. AUC comparison of TFImpute with DeepBind and gkm-SVM using shuffled sequences as negative instances.**
(A) Comparison with DeepBind. Each point in the figure corresponds to a TF-cell line combination. (B) AUC for TF-cell line combinations in which DeepBind gives the lowest AUC. (C) Comparison with gkm-SVM using randomly shuffled sequences as negative instances. Each point in the figure corresponds to a TF-cell line combination.

**Fig 3. Comparison with gkm-SVM, PIQ, and DeepSEA.**
(A) AUC comparison of TFImpute and gkm-SVM on TestSet1, TestSet2, and TestSet3. ‘Shuf cell line indicates that the cell line of the corresponding test set was shuffled and that the trained TFImpute model was then applied to the shuffled dataset. Similarly, ‘Shuf TF’ indicates that the TFs were shuffled. For some of the given regions, PIQ give NA predictions. NA means that there is no motif based on log probability threshold of 5, or the region is lack of DNase I signal. PIQNoNA in this figure denotes the result after removing all NAs and PIQ denotes the result after treating NAs as no binding. To calculate the AUC, the predictions were grouped by TFs. The middle bar in each box indicates the median. (B) AUC comparison based on predictions grouped by TF-cell line combinations. (C) The recall rates of different methods at FDR 0.05 (See Material and methods for more details). The predictions were grouped by TFs. (D) AUC comparison of TFImpute on TFs appearing in both TestSet2 and TestSet3. (E) Hierarchical clustering of a subset of the TFs based on the learned embedding by TFImpute. The full clustering is shown in S3 Fig. (F) Hierarchical clustering of a subset of cell lines based on the learned embedding by TFImpute. The full clustering is shown in S4 Fig. (G) The recall rate of TFImpute and DeepSEA at different FDR cutoffs on the datasets provided by DeepSEA.

**Fig 4. The distributions of the calculated enhancer signature for the top and bottom 100 enhancers.**
The p value is calculated using t-test. We would like to emphasize the lack of data of the enhancer reporter assay of GM12878, which is a good control.

**Fig 5. Predicted binding affinity change between two alleles of SNP rs12740374 (T/G).**
The color in each cell represents the predicted binding affinity of allele G minus that of allele T for the corresponding TF and cell line. The number in each cell of the heatmap is the number of ChIP-seq datasets in the training set for the corresponding TF and cell line. If TFImpute predicted strong binding in the minor allele but no binding in the major allele, the score was 1. If TFImpute predicted no binding difference between the two alleles, the score was 0.

See this image and copyright information in PMC

Cited by

Opportunities and obstacles for deep learning in biology and medicine.
Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, DeCaprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, Greene CS. Ching T, et al. J R Soc Interface. 2018 Apr;15(141):20170387. doi: 10.1098/rsif.2017.0387. J R Soc Interface. 2018. PMID: 29618526 Free PMC article. Review.
DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data.
Arango-Argoty G, Garner E, Pruden A, Heath LS, Vikesland P, Zhang L. Arango-Argoty G, et al. Microbiome. 2018 Feb 1;6(1):23. doi: 10.1186/s40168-018-0401-z. Microbiome. 2018. PMID: 29391044 Free PMC article.
Landscape of transcriptional deregulation in lung cancer.
Zhang S, Li M, Ji H, Fang Z. Zhang S, et al. BMC Genomics. 2018 Jun 5;19(1):435. doi: 10.1186/s12864-018-4828-1. BMC Genomics. 2018. PMID: 29866045 Free PMC article.
Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models.
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Yue T, et al. Int J Mol Sci. 2023 Nov 1;24(21):15858. doi: 10.3390/ijms242115858. Int J Mol Sci. 2023. PMID: 37958843 Free PMC article. Review.
DeepD2V: A Novel Deep Learning-Based Framework for Predicting Transcription Factor Binding Sites from Combined DNA Sequence.
Deng L, Wu H, Liu X, Liu H. Deng L, et al. Int J Mol Sci. 2021 May 24;22(11):5521. doi: 10.3390/ijms22115521. Int J Mol Sci. 2021. PMID: 34073774 Free PMC article.

See all "Cited by" articles

References

1. Mathelier A, Shi W, Wasserman WW. Identification of altered cis-regulatory elements in human disease. Trends in Genetics. 2015;31: 67–76. 10.1016/j.tig.2014.12.003 - DOI - PubMed
1. Weinhold N, Jacobsen A, Schultz N, Sander C, Lee W. Genome-wide analysis of noncoding regulatory mutations in cancer. Nat Genet. 2014;46: 1160–1165. 10.1038/ng.3101 - DOI - PMC - PubMed
1. Friedensohn S, Sawarkar R. Cis-regulatory variation: significance in biomedicine and evolution. Cell Tissue Res. 2014;356: 495–505. 10.1007/s00441-014-1855-3 - DOI - PubMed
1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316: 1497–1502. 10.1126/science.1141319 - DOI - PubMed
1. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489: 57–74. 10.1038/nature11247 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Imputation for transcription factor binding predictions based on deep learning

Affiliation

Imputation for transcription factor binding predictions based on deep learning

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous