Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 8;15(1):298.
doi: 10.1186/1471-2105-15-298.

nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification

Affiliations

nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification

Li Song et al. BMC Bioinformatics. .

Abstract

Background: DNA-binding proteins are vital for the study of cellular processes. In recent genome engineering studies, the identification of proteins with certain functions has become increasingly important and needs to be performed rapidly and efficiently. In previous years, several approaches have been developed to improve the identification of DNA-binding proteins. However, the currently available resources are insufficient to accurately identify these proteins. Because of this, the previous research has been limited by the relatively unbalanced accuracy rate and the low identification success of the current methods.

Results: In this paper, we explored the practicality of modelling DNA binding identification and simultaneously employed an ensemble classifier, and a new predictor (nDNA-Prot) was designed. The presented framework is comprised of two stages: a 188-dimension feature extraction method to obtain the protein structure and an ensemble classifier designated as imDC. Experiments using different datasets showed that our method is more successful than the traditional methods in identifying DNA-binding proteins. The identification was conducted using a feature that selected the minimum Redundancy and Maximum Relevance (mRMR). An accuracy rate of 95.80% and an Area Under the Curve (AUC) value of 0.986 were obtained in a cross validation. A test dataset was tested in our method and resulted in an 86% accuracy, versus a 76% using iDNA-Prot and a 68% accuracy using DNA-Prot.

Conclusions: Our method can help to accurately identify DNA-binding proteins, and the web server is accessible at http://datamining.xmu.edu.cn/~songli/nDNA. In addition, we also predicted possible DNA-binding protein sequences in all of the sequences from the UniProtKB/Swiss-Prot database.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flow construction of the 188D feature extraction method. For the loop body, the number of physicochemical properties is equivalent to the number of loops.
Figure 2
Figure 2
Framework of the ensemble classifier imDC. n represents the number of minority samples, and m stands for the number of majority samples. The loop body is run for iterNum times.
Figure 3
Figure 3
Comparison of the accuracy between the ensemble classifier imDC and the other classifiers using each of the thresholds.
Figure 4
Figure 4
Comparison of the F-measure of the ensemble classifier imDC and the other classifiers using the 0.4 threshold.
Figure 5
Figure 5
Comparison between balanced dataset in SVM and unbalanced dataset in imDC.
Figure 6
Figure 6
The accuracy of several feature extraction methods using different thresholds.
Figure 7
Figure 7
The accuracy of different datasets using the same ensemble classifier.
Figure 8
Figure 8
An indicator variation diagram of the different features after selection.

References

    1. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A. Uniprotkb/swiss-prot. Plant Bioinformatics. Humana Press. 2007;406:89–112. - PubMed
    1. Lin W-Z, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One. 2011;6(9):e24756. doi: 10.1371/journal.pone.0024756. - DOI - PMC - PubMed
    1. Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One. 2013;8(2):e56499. doi: 10.1371/journal.pone.0056499. - DOI - PMC - PubMed
    1. Chen W, Liu X, Huang Y, Jiang Y, Zou Q, Lin C. Improved method for predicting the protein fold pattern with ensemble classifiers. Genet Mol Res. 2012;11(1):174–181. doi: 10.4238/2012.January.27.4. - DOI - PubMed
    1. Liu B, Wang X, Chen Q, Dong Q, Lan X. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One. 2012;7(9):e46633. doi: 10.1371/journal.pone.0046633. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources