Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 May 19:7:262.
doi: 10.1186/1471-2105-7-262.

Predicting DNA-binding sites of proteins from amino acid sequence

Affiliations

Predicting DNA-binding sites of proteins from amino acid sequence

Changhui Yan et al. BMC Bioinformatics. .

Abstract

Background: Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene regulation. We present a machine learning approach for the identification of amino acid residues involved in protein-DNA interactions.

Results: We start with a Naïve Bayes classifier trained to predict whether a given amino acid residue is a DNA-binding residue based on its identity and the identities of its sequence neighbors. The input to the classifier consists of the identities of the target residue and 4 sequence neighbors on each side of the target residue. The classifier is trained and evaluated (using leave-one-out cross-validation) on a non-redundant set of 171 proteins. Our results indicate the feasibility of identifying interface residues based on local sequence information. The classifier achieves 71% overall accuracy with a correlation coefficient of 0.24, 35% specificity and 53% sensitivity in identifying interface residues as evaluated by leave-one-out cross-validation. We show that the performance of the classifier is improved by using sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs) as additional input. The classifier achieves 78% overall accuracy with a correlation coefficient of 0.28, 44% specificity and 41% sensitivity in identifying interface residues. Examination of the predictions in the context of 3-dimensional structures of proteins demonstrates the effectiveness of this method in identifying DNA-binding sites from sequence information. In 33% (56 out of 171) of the proteins, the classifier identifies the interaction sites by correctly recognizing at least half of the interface residues. In 87% (149 out of 171) of the proteins, the classifier correctly identifies at least 20% of the interface residues. This suggests the possibility of using such classifiers to identify potential DNA-binding motifs and to gain potentially useful insights into sequence correlates of protein-DNA interactions.

Conclusion: Naïve Bayes classifiers trained to identify DNA-binding residues using sequence information offer a computationally efficient approach to identifying putative DNA-binding sites in DNA-binding proteins and recognizing potential DNA-binding motifs.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Visualization of predicted DNA-binding residues on 3-D Structure. The predicted interface residues are shown in red on protein surface. DNA molecules bound to the proteins are shown in blue. A: The predictions on C/Ebpβ from PDB complex 1gu4, the 3rd best out of the 179 proteins in terms of correlation coefficient. B: The predictions on I-TevI from PDB complex 1i3j, the 114th best out of the 179 proteins. Figures are generated using Protein Explorer [38].
Figure 2
Figure 2
Receiver Operating Characteristic curve (ROC curve) for interface residue identification.
Figure 3
Figure 3
Comparison of actual and predicted DNA-binding site residues for transcription factor CREB (PDB 1dh3A). PROSITE motif BZIP_BASIC (bottom row) covers many of the actual interface residues (the first row below sequence). Note that the predictions of Naïve Bayes classifier (the second row below sequence) overlap with the PROSITE motifs, but more closely correspond to the actual interface residues.
Figure 4
Figure 4
The ROC curves for the Naïve Bayes classifier and the PSSM-based classifier. The Naïve Bayes classifier uses the identities of 9 amino acid residues as input. The ROC for the Naïve Bayes classifier is obtained using Weka on 86 DNA-binding proteins with lengths ranging from 40 to 200 residues with pairwise sequence similarity less than 30%. The ROC for the PSSM-based classifier is generated using the true positive, false positive, true negative, and false negative predictions obtained by submitting the 86 sequences to the online server [16] that implements PSSM-based classifier developed by Ahmad and Sarai [15].
Figure 5
Figure 5
The predictions on the S subunit of the type I (R-M) system from M. jannaschi. The predicted interface residues are shown in red. The DNA molecules from the interaction model proposed by Kim et al. [17] are shown in blue. The locations of R units in Kim's model are indicated by circles. Figures are generated using Protein Explorer [38].

References

    1. Ghosh D, Papavassiliou AG. Transcription factor therapeutics: long-shot or lodestone. Curr Med Chem. 2005;12:691–701. - PubMed
    1. Blancafort P, Segal DJ, Barbas CFIII. Designing transcription factor architectures for drug discovery. Mol Pharmacol. 2004;66:1361–1371. doi: 10.1124/mol.104.002758. - DOI - PubMed
    1. Pabo CO, Sauer RT. Transcription factors: structural families and principles of DNA recognition. Annu Rev Biochem. 1992;61:1053–1095. doi: 10.1146/annurev.bi.61.070192.005201. - DOI - PubMed
    1. Laity JH, Lee BM, Wright PE. Zinc finger proteins: new insights into structural and functional diversity. Current Opinion in Structural Biology. 2001;11:39–46. doi: 10.1016/S0959-440X(00)00167-6. - DOI - PubMed
    1. Lawson CL, Swigon D, Murakami KS, Darst SA, Berman HM, Ebright RH. Catabolite activator protein: DNA binding and transcription activation. Current Opinion in Structural Biology. 2004;14:10–20. doi: 10.1016/j.sbi.2004.01.012. - DOI - PMC - PubMed

Publication types