Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jul 7;10 Suppl 1(Suppl 1):S1.
doi: 10.1186/1471-2164-10-S1-S1.

Prediction of DNA-binding residues from protein sequence information using random forests

Affiliations

Prediction of DNA-binding residues from protein sequence information using random forests

Liangjiang Wang et al. BMC Genomics. .

Abstract

Background: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data.

Results: A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures.

Conclusion: The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF http://bioinfo.ggc.org/bindn-rf/ has thus been developed to make the RF classifier accessible to the biological research community.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic diagram for extracting evolutionary information from the PSI-BLAST search result.
Figure 2
Figure 2
ROC curves to show the effect of evolutionary information. HKM represents the random forest classifier trained with the three biochemical features (H, K and M), and HKM+EI indicates the most accurate classifier using evolutionary information (PSSM, Hm, Hd, Km and Kd).
Figure 3
Figure 3
ROC curves of different classifiers for DNA-binding site prediction. The performance comparison is based on the PDC25t test dataset. The four different classifiers are BindN-RF (this study), BindN [10], DP-Bind [7,8] and DBS-PSSM [6].
Figure 4
Figure 4
Predicted DNA-binding residues shown in the context of three-dimensional structures. Putative DNA-binding residues were predicted for the bacterial transcriptional regulator QacR (PDB ID: 1JT0) using BindN-RF (A) and BindN (B). In each protein-DNA complex, true positives (correctly predicted DNA-binding residues) are in red spacefill; true negatives in green wireframe; false positives in yellow spacefill; false negatives in blue spacefill; and the DNA double helix in purple.

References

    1. Ptashne M. Regulation of transcription: from lambda to eukaryotes. Trends Biochem Sci. 2005;30:275–279. doi: 10.1016/j.tibs.2005.04.003. - DOI - PubMed
    1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. - DOI - PMC - PubMed
    1. Sarai A, Kono H. Protein-DNA recognition patterns and predictions. Annu Rev Biophys Biomol Struct. 2005;34:379–398. doi: 10.1146/annurev.biophys.34.040204.144537. - DOI - PubMed
    1. Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004;20:477–486. doi: 10.1093/bioinformatics/btg432. - DOI - PubMed
    1. Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D, Honavar V. Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 2006;7:262. doi: 10.1186/1471-2105-7-262. - DOI - PMC - PubMed

Substances

LinkOut - more resources