Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
- PMID: 21342579
- PMCID: PMC3044304
- DOI: 10.1186/1471-2105-12-S1-S47
Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
Abstract
Background: Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on known properties of binding mechanism and experience of designers. However, there exists a troublesome problem for designers that some different physicochemical properties have similar vectors of representing 20 amino acids and some closely related physicochemical properties have dissimilar vectors.
Results: This study proposes a systematic approach (named Auto-IDPCPs) to automatically identify a set of physicochemical and biochemical properties in the AAindex database to design SVM-based classifiers for predicting and analyzing DNA-binding domains/proteins. Auto-IDPCPs consists of 1) clustering 531 amino acid indices in AAindex into 20 clusters using a fuzzy c-means algorithm, 2) utilizing an efficient genetic algorithm based optimization method IBCGA to select an informative feature set of size m to represent sequences, and 3) analyzing the selected features to identify related physicochemical properties which may affect the binding mechanism of DNA-binding domains/proteins. The proposed Auto-IDPCPs identified m = 22 features of properties belonging to five clusters for predicting DNA-binding domains with a five-fold cross-validation accuracy of 87.12%, which is promising compared with the accuracy of 86.62% of the existing method PSSM-400. For predicting DNA-binding sequences, the accuracy of 75.50% was obtained using m = 28 features, where PSSM-400 has an accuracy of 74.22%. Auto-IDPCPs and PSSM-400 have accuracies of 80.73% and 82.81%, respectively, applied to an independent test data set of DNA-binding domains. Some typical physicochemical properties discovered are hydrophobicity, secondary structure, charge, solvent accessibility, polarity, flexibility, normalized Van Der Waals volume, pK (pK-C, pK-N, pK-COOH and pK-a(RCOOH)), etc.
Conclusions: The proposed approach Auto-IDPCPs would help designers to investigate informative physicochemical and biochemical properties by considering both prediction accuracy and analysis of binding mechanism simultaneously. The approach Auto-IDPCPs can be also applicable to predict and analyze other protein functions from sequences.
Figures









Similar articles
-
FRKAS: knowledge acquisition using a fuzzy rule base approach to insight of DNA-binding domains/proteins.Protein Pept Lett. 2013 Mar;20(3):299-308. doi: 10.2174/0929866511320030008. Protein Pept Lett. 2013. PMID: 22591472
-
Computational identification of ubiquitylation sites from protein sequences.BMC Bioinformatics. 2008 Jul 15;9:310. doi: 10.1186/1471-2105-9-310. BMC Bioinformatics. 2008. PMID: 18625080 Free PMC article.
-
ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization.BMC Bioinformatics. 2008 Feb 1;9:80. doi: 10.1186/1471-2105-9-80. BMC Bioinformatics. 2008. PMID: 18241343 Free PMC article.
-
Real value prediction of protein solvent accessibility using enhanced PSSM features.BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S12. doi: 10.1186/1471-2105-9-S12-S12. BMC Bioinformatics. 2008. PMID: 19091011 Free PMC article.
-
Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment.IEEE/ACM Trans Comput Biol Bioinform. 2020 Nov-Dec;17(6):1918-1931. doi: 10.1109/TCBB.2019.2911677. Epub 2020 Dec 8. IEEE/ACM Trans Comput Biol Bioinform. 2020. PMID: 30998480 Review.
Cited by
-
Prediction of heme binding residues from protein sequences with integrative sequence profiles.Proteome Sci. 2012 Jun 21;10 Suppl 1(Suppl 1):S20. doi: 10.1186/1477-5956-10-S1-S20. Proteome Sci. 2012. PMID: 22759579 Free PMC article.
-
Discovery of prognostic biomarkers for predicting lung cancer metastasis using microarray and survival data.BMC Bioinformatics. 2015 Feb 21;16:54. doi: 10.1186/s12859-015-0463-x. BMC Bioinformatics. 2015. PMID: 25881029 Free PMC article.
-
Benchmarking recent computational tools for DNA-binding protein identification.Brief Bioinform. 2024 Nov 22;26(1):bbae634. doi: 10.1093/bib/bbae634. Brief Bioinform. 2024. PMID: 39657630 Free PMC article.
-
Use Chou's 5-Step Rule to Predict DNA-Binding Proteins with Evolutionary Information.Biomed Res Int. 2020 Jul 27;2020:6984045. doi: 10.1155/2020/6984045. eCollection 2020. Biomed Res Int. 2020. PMID: 32775434 Free PMC article.
-
PR2ALIGN: a stand-alone software program and a web-server for protein sequence alignment using weighted biochemical properties of amino acids.BMC Res Notes. 2015 May 7;8:187. doi: 10.1186/s13104-015-1152-6. BMC Res Notes. 2015. PMID: 25947299 Free PMC article.
References
-
- Cai YD, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta. 2003;1648(1-2):127–133. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources