Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 23;25(6):bbae489.
doi: 10.1093/bib/bbae489.

MLSNet: a deep learning model for predicting transcription factor binding sites

Affiliations

MLSNet: a deep learning model for predicting transcription factor binding sites

Yuchuan Zhang et al. Brief Bioinform. .

Abstract

Accurate prediction of transcription factor binding sites (TFBSs) is essential for understanding gene regulation mechanisms and the etiology of diseases. Despite numerous advances in deep learning for predicting TFBSs, their performance can still be enhanced. In this study, we propose MLSNet, a novel deep learning architecture designed specifically to predict TFBSs. MLSNet innovatively integrates multisize convolutional fusion with long short-term memory (LSTM) networks to effectively capture DNA-sparse higher-order sequence features. Further, MLSNet incorporates super token attention and Bi-LSTM to systematically extract and integrate higher-order DNA shape features. Experimental results on 165 ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets indicate that MLSNet consistently outperforms several state-of-the-art algorithms in the prediction of TFBSs. Specifically, MLSNet reports average metrics: 0.8306 for ACC, 0.8992 for AUROC, and 0.9035 for AUPRC, surpassing the second-best methods by 1.82%, 1.68%, and 1.54%, respectively. This research delineates the effectiveness of combining multi-size convolutional layers with LSTM and DNA shape-based features in enhancing predictive accuracy. Moreover, this study comprehensively assesses the variability in model performance across different cell lines and transcription factors. The source code of MLSNet is available at https://github.com/minghaidea/MLSNet.

Keywords: DNA sequence; DNA shape; multisize convolutional fusion; super token attention and Bi-LSTM; transcription factor binding sites.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the MLSNet workflow. (A) Data preprocessing: This part involves the preparation of sequence data and shape data. (B) Deep learning framework: It consists of: sequence data processing flow (integrating multiscale convolutional fusion with LSTM), shape data processing flow (employing Super Token Attention and Bi-LSTM), and the output module. Note: “conv” means “convolution”.
Figure 2
Figure 2
Performance comparison of MLSNet and variant models on 165 ChIP-seq datasets. (MLSNet-1: Without multisize convolutional fusion with LSTM; MLSNet-2: Without supplemental shape data). (A) ACC: This part involves the ACC comparison between MLSNet and variant models. (B) ROC-AUC: This part involves the ROC-AUC comparison between MLSNet and variant models. (C) PR-AUC: This part involves the PR-AUC comparison between MLSNet and variant models. (D) Average results: This part involves the average ACC, ROC-AUC, and PR-AUC comparison between MLSNet and variant models.
Figure 3
Figure 3
Distribution of MLSNet results on ACC, ROC-AUC, and PR-AUC on 165 ChIP-seq datasets.
Figure 4
Figure 4
Performance comparison of MLSNet with competing models on selected cell lines and TFs. (A) ACC of cell lines: This part involves the ACC comparison between MLSNet and competing models on selected cell lines. (B) ROC-AUC of cell lines: This part involves the ROC-AUC comparison between MLSNet and competing models on selected cell lines. (C) PR-AUC of cell lines: This part involves the PR-AUC comparison between MLSNet and competing models on selected cell lines. (D) ACC of TFs: This part involves the ACC comparison between MLSNet and competing models on selected TFs. (E) ROC-AUC of TFs: This part involves the ROC-AUC comparison between MLSNet and competing models on selected TFs. (F) PR-AUC of TFs: This part involves the PR-AUC comparison between MLSNet and competing models on selected TFs.
Figure 5
Figure 5
The heatmap of ACC results for MLSNet and other competing models, evaluated across all cell lines and transcription factors within 165 ChIP-seq datasets.
Figure 6
Figure 6
Overview of the comparative analysis and average results of MLSNet and competing models’ results on 165 ChIP-seq datasets. (A) ACC: This part involves the ACC comparison between MLSNet and competing models. (B) ROC-AUC: This part involves the ROC-AUC comparison between MLSNet and competing models. (C) PR-AUC: This part involves the PR-AUC comparison between MLSNet and competing models. (D) Average ACC: This part involves the average ACC comparison between MLSNet and competing models. (E) Average ROC-AUC: This part involves the average ROC-AUC comparison between MLSNet and competing models. (F) Average PR-AUC: This part involves the average PR-AUC comparison between MLSNet and competing models.

References

    1. Guo JT, Lofgren S, Farrel A. Structure-based prediction of transcription factor binding sites. Tsinghua Sci Technol 2014;19:568–77.
    1. Dunham I, Kundaje A, Aldred SF. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74. - PMC - PubMed
    1. Kaiser MI. ENCODE and the parts of the human genome. Stud Hist Phil Biol Biomed Sci 2018;72:28–37. - PubMed
    1. Chen X-F, Zhang Y-W, Xu H. et al. Transcriptional regulation and its misregulation in alzheimer’s disease. Mol Brain 2013;6:1–9. - PMC - PubMed
    1. Stormo Gary D. [13] consensus patterns in dna. Elsevier 1990;211–21. - PubMed