Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Feb 2:10:4.
doi: 10.1186/1472-6807-10-4.

Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues

Affiliations

Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues

Shirley Wu et al. BMC Struct Biol. .

Abstract

Background: The emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins. Unfortunately, our ability to analyze these proteins is restricted by the limited catalog of known molecular functions and their associated 3D motifs.

Results: In order to identify novel 3D motifs that may be associated with molecular functions, we employ an unsupervised, two-phase clustering approach that combines k-means and hierarchical clustering with knowledge-informed cluster selection and annotation methods. We applied the approach to approximately 20,000 cysteine-based protein microenvironments (3D regions 7.5 A in radius) and identified 70 interesting clusters, some of which represent known motifs (e.g. metal binding and phosphatase activity), and some of which are novel, including several zinc binding sites. Detailed annotation results are available online for all 70 clusters at http://feature.stanford.edu/clustering/cys.

Conclusions: The use of microenvironments instead of backbone geometric criteria enables flexible exploration of protein function space, and detection of recurring motifs that are discontinuous in sequence and diverse in structure. Clustering microenvironments may thus help to functionally characterize novel proteins and better understand the protein structure-function relationship.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of functional site discovery approach. Starting from thousands of protein microenvironments, we use k-means clustering to group them into coarse clusters. Each coarse cluster is then hierarchically clustered, and optimal clusters are identified using a scoring function that incorporates knowledge from scientific literature. These clusters are annotated using information from literature, Swiss-Prot records, and PDB HETATM data to produce novel individual site annotations and potentially novel functional motifs.
Figure 2
Figure 2
Functional coherence of random, functional, and dilute functional clusters. a) We show median functional coherence scores for random clusters, as well as clusters derived from functional site patterns. "PROSITE min" refers to the minimum cluster size for each PROSITE pattern cluster in Table 1 (derived from training sets used for existing FEATURE models [16]), while "PROSITE max" refers to the maximum size of each cluster. The PROSITE subsets were randomly sampled from the max PROSITE clusters, while the random clusters were randomly sampled from all Swiss-Prot proteins. The median functional coherence for the random clusters is clearly much lower than that for clusters derived from PROSITE. b) We plotted functional coherence as a function of percent signal. We decreased functional signal by randomly replacing members of the six "PROSITE min" clusters with either structurally similar proteins (left), or random proteins (right). Functional coherence decreases exponentially as the proportion of biological signal decreases.
Figure 3
Figure 3
Two distinct clusters for copper binding. (a) Clust33-Sub49 consists of copper-binding environments from blue copper proteins involved in electron transport. (b) Clust1-Sub13 consists of copper-binding environments from multicopper oxidase proteins, so named because they contain multiple copper centers. The mode of binding for both types of proteins is similar. All microenvironment images were generated using PyMol [59].
Figure 4
Figure 4
Different types of zinc binding sites. Our cluster selection approach divides several clusters into smaller groups of zinc binding site environments. Many of these represent different types of zinc binding sites: (from left to right) coordination by four CYS residues, coordination by three CYS and one HIS residue, coordination by two CYS and two HIS residues (C2H2 type), and coordination of multiple zinc ions by many diverse residues, including CYS, HIS, ASP, GLU, and water.
Figure 5
Figure 5
Potentially novel zinc binding sites in Clust1-Sub53. We predict zinc binding sites for (from left to right) structures 1GY8:A (no Swiss-Prot accession number) at CYS274, 1UC2:A [Swiss-Prot:O59245] at CYS98, and 1NYQ:A [Swiss-Prot:Q8NW68] at CYS181 based on zinc binding for other microenvironments in this cluster. Features supporting this prediction include the presence of multiple HIS residues and occasionally ASP or GLU, all known to coordinate zinc.
Figure 6
Figure 6
Clust8-Sub25 - Novel microenvironment motif with a potential structural role. Five representative microenvironments from a total of 11 are shown. This set of microenvironments is characterized by the central CYS based in a helix with the sidechain surrounded by an abundance of aliphatic, hydrophobic sidechains (ILE, LEU, VAL). Cysteines are often important for stabilizing protein structures, and the absence of reactive sidechains combined with the striking similarity between members of this cluster suggest a potential structural role for this microenvironment.
Figure 7
Figure 7
Clust5-Sub70 - potential TYR autophosphorylation site. This cluster contains 12 microenvironments, eight of which belong to tyrosine kinases. In the eight kinase microenvironments, the CYS is on a loop next to a helix containing a TYR residue; the environment as a whole is surface-exposed and contains additional sulfur-containing residues. From left to right, we show 1K9A:A [Swiss-Prot:P32577], in which TYR416 is annotated as a putative autophosphorylation site (by similarity), 1LUF:A [Swiss-Prot:Q62838], in which TYR831 is not annotated as a potential phosphorylation site, and 1Z45:A [Swiss-Prot:P04397], a yeast aldose 1-epimerase, which is not a TYR kinase. There is, however, a surface-exposed TYR in a loop environment with an additional sulfur-containing residue.
Figure 8
Figure 8
Clust36-Sub127 - Novel microenvironment motif with a potential functional role. This microenvironment motif is surface exposed and contains an ASP (red) and a LYS (blue) around the central CYS (yellow) in a potentially functional triad in four out of five cases. As these are all residues known to participate in chemical reactions, it is possible there is an active role for this recurring microenvironment.
Figure 9
Figure 9
Example annotation output for Clust21-Sub27, TYR phosphatase active sites. The HTML output for the cluster annotation method is shown for a tyrosine phosphatase active site cluster. A summary page showing general cluster information and top significant terms for each annotation type contains links to more detailed information for each type of annotation, including lists of proteins mapped to each annotation term. Detailed literature output shows the proteins and PMIDs contributing to each annotation term and abstract text for each PMID.

Similar articles

Cited by

References

    1. Hendrickson WA. Impact of structures from the Protein Structure Initiative. Structure. 2007;15(12):1528–1529. doi: 10.1016/j.str.2007.11.006. - DOI - PubMed
    1. Lattman E. The state of the Protein Structure Initiative. Proteins. 2004;54(4):611–615. doi: 10.1002/prot.20000. - DOI - PubMed
    1. Brenner SE. A tour of structural genomics. Nat Rev Genet. 2001;2(10):801–809. doi: 10.1038/35093574. - DOI - PubMed
    1. Marsden RL, Lewis TA, Orengo CA. Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC Bioinformatics. 2007;8(86):1528–1529. - PMC - PubMed
    1. Sonnhammer E, Eddy S, Birney E, Bateman A, Durbin R. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 1998;26:320–322. doi: 10.1093/nar/26.1.320. - DOI - PMC - PubMed

Publication types

LinkOut - more resources