Using structural motif descriptors for sequence-based binding site prediction

Andreas Henschel¹, Christof Winter, Wan Kyu Kim, Michael Schroeder

Affiliations

PMID: 17570148
PMCID: PMC1892084
DOI: 10.1186/1471-2105-8-S4-S5

Using structural motif descriptors for sequence-based binding site prediction

Andreas Henschel et al. BMC Bioinformatics. 2007.

. 2007 May 22;8 Suppl 4(Suppl 4):S5.

doi: 10.1186/1471-2105-8-S4-S5.

Authors

Andreas Henschel¹, Christof Winter, Wan Kyu Kim, Michael Schroeder

Affiliation

¹ Biotechnological Center, TU Dresden, Tatzberg 47-51, Dresden, Germany. ah@biotec.tu-dresden.de

PMID: 17570148
PMCID: PMC1892084
DOI: 10.1186/1471-2105-8-S4-S5

Abstract

Background: Many protein sequences are still poorly annotated. Functional characterization of a protein is often improved by the identification of its interaction partners. Here, we aim to predict protein-protein interactions (PPI) and protein-ligand interactions (PLI) on sequence level using 3D information. To this end, we use machine learning to compile sequential segments that constitute structural features of an interaction site into one profile Hidden Markov Model descriptor. The resulting collection of descriptors can be used to screen sequence databases in order to predict functional sites.

Results: We generate descriptors for 740 classified types of protein-protein binding sites and for more than 3,000 protein-ligand binding sites. Cross validation reveals that two thirds of the PPI descriptors are sufficiently conserved and significant enough to be used for binding site recognition. We further validate 230 PPIs that were extracted from the literature, where we additionally identify the interface residues. Finally we test ligand-binding descriptors for the case of ATP. From sequences with Swiss-Prot annotation "ATP-binding", we achieve a recall of 25% with a precision of 89%, whereas Prosite's P-loop motif recognizes an equal amount of hits at the expense of a much higher number of false positives (precision: 57%). Our method yields 771 hits with a precision of 96% that were not previously picked up by any Prosite-pattern.

Conclusion: The automatically generated descriptors are a useful complement to known Prosite/InterPro motifs. They serve to predict protein-protein as well as protein-ligand interactions along with their binding site residues for proteins where merely sequence information is available.

PubMed Disclaimer

Figures

**Figure 1**
**Constructing a set of sequence profiles to represent a conserved structural feature**. Caspase's active site is highly conserved (1ICE, conservation levels are calculated using the von-Neumann entropy and displayed in a color gradient from blue (variable) to red (conserved)). Conserved residues in close vicinity of the tetrapeptide inhibitor largely define the catalytic site environment. Caspase residues within 5 Å of the inhibitor are underlined. Segments are patched and those with low conservation are discarded to avoid insignificant hits. We add the amino acid distribution from HSSP data for each site of the remaining segments. It is thus possible to construct HMMs and visualize the profiles as sequence logos [40].

**Figure 2**
**Assessing accuracy and significance of ATP-binding descriptors**. A. Precision-recall curve for ATP-binding descriptors derived from protein structures with bound ATP or ADP tested against Swiss-Prot, shown as curve with red circles. Each circled point corresponds to a different E-value cutoff. The Prosite patterns for "ATP-binding" and "ADP-binding" are included as well (green crosses). Overall, Prosite achieves a recall of 31% with a precision of 62% (blue cross). For all E-values, our method performs better than Prosite. B. Distribution of E-values for the ATP-binding descriptors. To assess the significance of hits, the descriptors were tested both against Swiss-Prot (black line) and a shuffled Swiss-Prot version (red line). The cumulative number of hits below a certain E-value threshold is shown. The inlet shows a magnification of the lower right corner. Below an E-value of 1 (dotted vertical line), ~53,000 hits are found in Swiss-Prot whereas only ~1,200 hits are found in the shuffled Swiss-Prot.

**Figure 3**
**Correlation of length and quality of HMM descriptors**. ATP-binding descriptors as well as face type descriptors for protein-protein interactions were run against original and shuffled versions of Swiss-Prot and uncharacterized NCBI sequences. We define the length of a profile Hidden Markov Model descriptor as its number of states. Quality is measured as difference between log E-values of best hit against original sequences and shuffled sequences. For Swiss-Prot, longer descriptors have better quality and therefore produce more significant hits. For uncharacterized sequences, this does not hold. One explanation could be that these sequences are depleted of significant matches by similarity searches.

**Figure 4**
**Validation of PLI-descriptors**. Prediction of sequences annotated with the Swiss-Prot keyword „ATP-binding“ (or ATP/ADP as part of the catalytic activity) using Prosite patterns and multiple motif descriptors with hits below E-value 1: while both methods detect ~9.000 of all proteins annotated with this term by Swiss-Prot, Prosite provides almost sevenfold more false positives.

**Figure 5**
**ATP binding motifs**. The ATP-binding descriptor derived from PDB entry 2BEK and HSSP is compared to the the P-loop pattern (below). A single conflict occurs in the first position, and more specific detail is given about the second to the fifth residue. The descriptor therefore correctly detects other chromosome partitioning ATPases.

**Figure 6**
**Work flow**. a) All instances for interactions between family A and family B with identical geometric interface classification are retrieved from SCOPPI. Interface residues are indicated in the accompanying multiple sequence alignments. b) Interface columns are defined by columns with more than 50% interface residues. Interface segments are defined by including interface flanking columns. c) HMMs are constructed for each interface segment using HMMer's hmmbuild. d) HMMs are merged by insert states with high self-loop probabilities as to model the non-interacting linker region e) The collection of all merged HMMs constitutes the descriptor library. f) Sequence searches against Swiss-Prot and uncharacterized sequences with all descriptors were done using HMMer's hmmsearch

**Figure 7**
**A sample alignment for RuBisCo N-terminal domain**. Interacting segments are highlighted in corresponding colors in the alignment as well as in the structure. Interface segment definition is illustrated by the three lines below the alignment. Finally, the 10 fold von Neumann entropy is printed.

See this image and copyright information in PMC

References

1. Galperin MY, Koonin EV. Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol. 2000;18:609–613. doi: 10.1038/76443. - DOI - PubMed
1. Tong AHY, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Menard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, Sdicu AM, Shapiro J, Sheikh B, Suter B, Wong SL, Zhang LV, Zhu H, Burd CG, Munro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G, Roth FP, Brown GW, Andrews B, Bussey H, Boone C. Global mapping of the yeast genetic interaction network. Science. 2004;303:808–813. doi: 10.1126/science.1091317. - DOI - PubMed
1. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999;96:4285–4288. doi: 10.1073/pnas.96.8.4285. - DOI - PMC - PubMed
1. Sun J, Xu J, Liu Z, Liu Q, Zhao A, Shi T, Li Y. Refined phylogenetic profiles method for predicting protein-protein interactions. Bioinformatics. 2005;21:3409–3415. doi: 10.1093/bioinformatics/bti532. - DOI - PubMed
1. Fraser HB, Hirsh AE, Wall DP, Eisen MB. Coevolution of gene expression among interacting proteins. Proc Natl Acad Sci USA. 2004;101:9033–9038. doi: 10.1073/pnas.0402591101. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using structural motif descriptors for sequence-based binding site prediction

Affiliation

Using structural motif descriptors for sequence-based binding site prediction

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous