. 2014 Sep 8;15(1):298.

doi: 10.1186/1471-2105-15-298.

nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification

Li Song, Dapeng Li, Xiangxiang Zeng, Yunfeng Wu, Li Guo¹, Quan Zou

Affiliations

PMID: 25196432
PMCID: PMC4165999
DOI: 10.1186/1471-2105-15-298

nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification

Li Song et al. BMC Bioinformatics. 2014.

. 2014 Sep 8;15(1):298.

doi: 10.1186/1471-2105-15-298.

Authors

Li Song, Dapeng Li, Xiangxiang Zeng, Yunfeng Wu, Li Guo¹, Quan Zou

Affiliation

¹ School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China. gl8008@163.com.

PMID: 25196432
PMCID: PMC4165999
DOI: 10.1186/1471-2105-15-298

Abstract

Background: DNA-binding proteins are vital for the study of cellular processes. In recent genome engineering studies, the identification of proteins with certain functions has become increasingly important and needs to be performed rapidly and efficiently. In previous years, several approaches have been developed to improve the identification of DNA-binding proteins. However, the currently available resources are insufficient to accurately identify these proteins. Because of this, the previous research has been limited by the relatively unbalanced accuracy rate and the low identification success of the current methods.

Results: In this paper, we explored the practicality of modelling DNA binding identification and simultaneously employed an ensemble classifier, and a new predictor (nDNA-Prot) was designed. The presented framework is comprised of two stages: a 188-dimension feature extraction method to obtain the protein structure and an ensemble classifier designated as imDC. Experiments using different datasets showed that our method is more successful than the traditional methods in identifying DNA-binding proteins. The identification was conducted using a feature that selected the minimum Redundancy and Maximum Relevance (mRMR). An accuracy rate of 95.80% and an Area Under the Curve (AUC) value of 0.986 were obtained in a cross validation. A test dataset was tested in our method and resulted in an 86% accuracy, versus a 76% using iDNA-Prot and a 68% accuracy using DNA-Prot.

Conclusions: Our method can help to accurately identify DNA-binding proteins, and the web server is accessible at http://datamining.xmu.edu.cn/~songli/nDNA. In addition, we also predicted possible DNA-binding protein sequences in all of the sequences from the UniProtKB/Swiss-Prot database.

PubMed Disclaimer

Figures

**Figure 1**
**Flow construction of the 188D feature extraction method.** For the loop body, the number of physicochemical properties is equivalent to the number of loops.

**Figure 2**
**Framework of the ensemble classifier imDC.** n represents the number of minority samples, and m stands for the number of majority samples. The loop body is run for iterNum times.

**Figure 3**
**Comparison of the accuracy between the ensemble classifier imDC and the other classifiers using each of the thresholds.**

**Figure 4**
**Comparison of the F**-**measure of the ensemble classifier imDC and the other classifiers using the 0.4 threshold.**

**Figure 5**
**Comparison between balanced dataset in SVM and unbalanced dataset in imDC.**

**Figure 6**
**The accuracy of several feature extraction methods using different thresholds.**

**Figure 7**
**The accuracy of different datasets using the same ensemble classifier.**

**Figure 8**
**An indicator variation diagram of the different features after selection.**

See this image and copyright information in PMC

References

1. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A. Uniprotkb/swiss-prot. Plant Bioinformatics. Humana Press. 2007;406:89–112. - PubMed
1. Lin W-Z, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One. 2011;6(9):e24756. doi: 10.1371/journal.pone.0024756. - DOI - PMC - PubMed
1. Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One. 2013;8(2):e56499. doi: 10.1371/journal.pone.0056499. - DOI - PMC - PubMed
1. Chen W, Liu X, Huang Y, Jiang Y, Zou Q, Lin C. Improved method for predicting the protein fold pattern with ensemble classifiers. Genet Mol Res. 2012;11(1):174–181. doi: 10.4238/2012.January.27.4. - DOI - PubMed
1. Liu B, Wang X, Chen Q, Dong Q, Lan X. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One. 2012;7(9):e46633. doi: 10.1371/journal.pone.0046633. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification

Affiliation

nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources