Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 29;18(1):379.
doi: 10.1186/s12859-017-1792-8.

EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation

Affiliations

EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation

Jiyun Zhou et al. BMC Bioinformatics. .

Abstract

Background: Prediction of DNA-binding residue is important for understanding the protein-DNA recognition mechanism. Many computational methods have been proposed for the prediction, but most of them do not consider the relationships of evolutionary information between residues.

Results: In this paper, we first propose a novel residue encoding method, referred to as the Position Specific Score Matrix (PSSM) Relation Transformation (PSSM-RT), to encode residues by utilizing the relationships of evolutionary information between residues. PDNA-62 and PDNA-224 are used to evaluate PSSM-RT and two existing PSSM encoding methods by five-fold cross-validation. Performance evaluations indicate that PSSM-RT is more effective than previous methods. This validates the point that the relationship of evolutionary information between residues is indeed useful in DNA-binding residue prediction. An ensemble learning classifier (EL_PSSM-RT) is also proposed by combining ensemble learning model and PSSM-RT to better handle the imbalance between binding and non-binding residues in datasets. EL_PSSM-RT is evaluated by five-fold cross-validation using PDNA-62 and PDNA-224 as well as two independent datasets TS-72 and TS-61. Performance comparisons with existing predictors on the four datasets demonstrate that EL_PSSM-RT is the best-performing method among all the predicting methods with improvement between 0.02-0.07 for MCC, 4.18-21.47% for ST and 0.013-0.131 for AUC. Furthermore, we analyze the importance of the pair-relationships extracted by PSSM-RT and the results validates the usefulness of PSSM-RT for encoding DNA-binding residues.

Conclusions: We propose a novel prediction method for the prediction of DNA-binding residue with the inclusion of relationship of evolutionary information and ensemble learning. Performance evaluation shows that the relationship of evolutionary information between residues is indeed useful in DNA-binding residue prediction and ensemble learning can be used to address the data imbalance issue between binding and non-binding residues. A web service of EL_PSSM-RT ( http://hlt.hitsz.edu.cn:8080/PSSM-RT_SVM/ ) is provided for free access to the biological research community.

Keywords: DNA-binding residue; DNA-protein interaction; Ensemble learning; PSSM; Random forest; Relation transformation; SVM.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
The framework diagram of EL_PSSM-RT. EL_PSSM-RT contains 4 steps. The first step is to divide the non-binding residues in the training dataset into n subsets and to construct n new training datasets by combining the n subsets of non-binding residues and binding residues individually. The secondary step is to extract the three categories of features for all the residues. The third step is to train both SVM classifier and Random Forest classifier by each category of features on every training subset. The fourth step is to use a dynamic ranking and selecting method to select the based predictors with the largest diversity between each other to build the ensemble predictor
Fig. 2
Fig. 2
The compact of window size w on performance of PSSM-RT. The x-axis is the window size w and y-axis is the ST value of PSSM-RT
Fig. 3
Fig. 3
Comparison between different encoding methods. a The ROC curves of PSSM-RT, Ma et al.’s method and Li et al.’s method on PDNA-62. b The ROC curves of PSSM-RT, Ma et al.’s method and Li et al.’s method on PDNA-224
Fig. 4
Fig. 4
Comparison between different encoding methods when combining with sequence features and physicochemical features. a The ROC curves of PSSM-RT, Ma et al.’s method and Li et al.’s method on PDNA-62. b The ROC curves of PSSM-RT, Ma et al.’s method and Li et al.’s method on PDNA-224
Fig. 5
Fig. 5
Comparison between EL_PSSM-RT, SVM classifier and Random Forest classifier. a The ROC curves EL_PSSM-RT, SVM classifier and Random Forest classifier on PDNA-62. b The ROC curves EL_PSSM-RT, SVM classifier and Random Forest classifier on PDNA-224
Fig. 6
Fig. 6
The feature analysis results of PSSM-RT on PDNA-62. a The discriminant weights of the 400 features extracted from PSSM-RT. The x axis and y axis denote the 20 residue types. Every element denotes a specific pair-relationship. b 6 DNA-binding residues and its context residues extracted from the protein in 1u1q. The red residues are the binding residues and the yellow ones are the residues that can form important pair-relationship with it. The rest ones are the unimportant residues. The black polyline are the important pair-relationships
Fig. 7
Fig. 7
Actual residues and predicted residues of proteins in 1s40 and 1b3t. a The predicted binding residues on the protein in 1s40. b The actual binding residues on the protein in 1s40. c The predicted binding residues on the protein in 1b3t. d The actual binding residues on the protein in 1b3t
Fig. 8
Fig. 8
The homepage of the web service of EL_PSSM-RT. The web address of this webserver is http://hlt.hitsz.edu.cn:8080/PSSM-RT_SVM/. See the description in the server description for further explanation

Similar articles

Cited by

References

    1. Ofran Y, Mysore V, Rost B. Prediction of DNA-binding residues from sequence. Bioinformatics. 2007;23(13):i347–i353. doi: 10.1093/bioinformatics/btm174. - DOI - PubMed
    1. Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of protein–DNA complexes. Genome Biol. 2000;1(1):1–37. doi: 10.1186/gb-2000-1-1-reviews001. - DOI - PMC - PubMed
    1. Walter MC, Rattei T, Arnold R, Guldener U, Munsterkotter M, Nenova K, Kastenmuller G, Tischler P, Wolling A, Volz A, et al. PEDANT covers all complete RefSeq genomes. Nucleic Acids Res. 2009;37:D408–D411. doi: 10.1093/nar/gkn749. - DOI - PMC - PubMed
    1. Luscombe NM, Thornton JM. Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity. J Mol Biol. 2002;320(5):991–1009. doi: 10.1016/S0022-2836(02)00571-5. - DOI - PubMed
    1. Bullock AN, Fersht AR. Rescuing the function of mutant p53. Nat Rev Cancer. 2001;1:68–76. doi: 10.1038/35094077. - DOI - PubMed