Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025;21(5):599-608.
doi: 10.2174/0115734099277249240129114123.

WSHNN: A Weakly Supervised Hybrid Neural Network for the Identification of DNA-protein Binding Sites

Affiliations

WSHNN: A Weakly Supervised Hybrid Neural Network for the Identification of DNA-protein Binding Sites

Wenzheng Bao et al. Curr Comput Aided Drug Des. 2025.

Abstract

Introduction: Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.

Methods: We proposed a framework for combining Multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.

Results: Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.

Conclusion: From the experimental results, it can be concluded that Bi-directional Long-Short- Term Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).

Keywords: DNA-protein binding; bioinformatics; convolutional neural networks.; multiple-instance learning; transcription factor binding site prediction; weakly supervised.

PubMed Disclaimer

References

    1. Tompa M.; Li N.; Bailey T.L.; Church G.M.; De Moor B.; Eskin E.; Favorov A.V.; Frith M.C.; Fu Y.; Kent W.J.; Makeev V.J.; Mironov A.A.; Noble W.S.; Pavesi G.; Pesole G.; Régnier M.; Simonis N.; Sinha S.; Thijs G.; van Helden J.; Vandenbogaert M.; Weng Z.; Workman C.; Ye C.; Zhu Z.; Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005,23(1),137-144 - DOI - PubMed
    1. Bhimsaria D.; Rodríguez-Martínez J.A.; Mendez-Johnson J.L.; Ghoshdastidar D.; Varadarajan A.; Bansal M.; Daniels D.L.; Ramanathan P.; Ansari A.Z.; Hidden modes of DNA binding by human nuclear receptors. Nat Commun 2023,14(1),4179 - DOI - PubMed
    1. Semlow D.R.; MacKrell V.A.; Walter J.C.; The HMCES DNA-protein cross-link functions as an intermediate in DNA interstrand cross-link repair. Nat Struct Mol Biol 2022,29(5),451-462 - DOI - PubMed
    1. Yaneva D; Sparks J L; Donsbach M; The FANCJ helicase unfolds DNA-protein crosslinks to promote their repair. Mol Cell 2023,83(1),43-56
    1. Gershenzon N.I.; Stormo G.D.; Ioshikhes I.P.; Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites. Nucleic Acids Res 2005,33(7),2290-2301 - DOI - PubMed

LinkOut - more resources