Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1998 Nov 10;95(23):13597-602.
doi: 10.1073/pnas.95.23.13597.

Large-scale protein structure modeling of the Saccharomyces cerevisiae genome

Affiliations

Large-scale protein structure modeling of the Saccharomyces cerevisiae genome

R Sánchez et al. Proc Natl Acad Sci U S A. .

Abstract

The function of a protein generally is determined by its three-dimensional (3D) structure. Thus, it would be useful to know the 3D structure of the thousands of protein sequences that are emerging from the many genome projects. To this end, fold assignment, comparative protein structure modeling, and model evaluation were automated completely. As an illustration, the method was applied to the proteins in the Saccharomyces cerevisiae (baker's yeast) genome. It resulted in all-atom 3D models for substantial segments of 1,071 (17%) of the yeast proteins, only 40 of which have had their 3D structure determined experimentally. Of the 1,071 modeled yeast proteins, 236 were related clearly to a protein of known structure for the first time; 41 of these previously have not been characterized at all.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Predicting the overall accuracy of comparative models. The good and bad models for proteins of known structure are used to tune the prediction of reliability of a model when the actual structure is not known (Fig. 2). See Materials and Methods for details. (A) A rule for assigning a comparative model into either the good or bad class, based on its Q_SCORE. Inset shows the distributions of Q_SCORE for the good and bad models with 100 to 150 residues. Such distributions are used with the Bayes theorem to calculate the posterior probability that a model is good, given that it has a certain Q_SCORE value, p(GOOD/Q_SCORE). The main plot shows the percentages of false positives (bad models classified as good) and false negatives (good models classified as bad) as a function of sequence length. The curves were obtained by the jack-knife procedure. (B) A rule for estimating the accuracy of a reliable model (as predicted by its Q_SCORE), based on the percentage sequence identity to the template. The overlaps of an experimentally determined protein structure with its model (red continuous line) and with a template on which the model was based (green dashed line) are shown as a function of the target–template sequence identity. This identity was calculated from the modeling alignment. The structure overlap is defined as the fraction of the equivalent Cα atoms. For comparison of the model with the actual structure (filled circles), two Cα atoms were considered equivalent if they were within 3.5 Å of each other and belonged to the same residue. For comparison of the template structure with the actual target structure (open circles), two Cα atoms were considered equivalent if they were within 3.5 Å after alignment and rigid-body superposition by the align3d command in modeller (15). The points correspond to the median values, and the error bars in the positive and negative directions correspond to the average positive and negative differences from the median, respectively. Points labeled α, β, and γ correspond to the models in (C). The empty circle at 25% sequence identity corresponds to the unusually accurate model in Fig. 3B. (C) The range of accuracy for reliable comparative models is illustrated by a difficult, medium, and easy case. The Cα backbones of the models (red) for YKR066C and YDR226W and all mainchain atoms for YER148W are superposed with those of the actual structures (blue). The PDB codes of the target and template structures also are shown (target/template). The three target–template sequence identities are indicated in B (black filled circles). The number of yeast ORF models at each accuracy level can be determined from the red curve in B, or the sample comparisons in C, combined with Fig. 2A.
Figure 2
Figure 2
Protein structure models for yeast ORFs. (A) Distribution of the sequence identity between the models and the corresponding templates as a function of model sequence length. The 3,992 reliable models for substantial segments of 1,071 different ORFs that are predicted to be based on a correct template and approximately correct alignment are represented by the green bars, and the 4,588 unreliable models that are predicted to be based on a mostly incorrect alignment or an incorrect template are represented by the red bars. The last histogram at label “All/6” is the sum of the other six histograms divided by six. (B) The corresponding distribution of the alignment significance score calculated by the program align (13).
Figure 3
Figure 3
Sample models calculated before the crystallographic structures have been deposited to PDB. (A) A model for the yeast prohormone-processing carboxypeptidase (YGL203C, violet) is compared with its actual crystallographic structure (1ac5, green) (38). The model was constructed based on the crystal structure of the yeast serine carboxypeptidase (1cpy) with which it shares only 25% sequence identity. Although the overall structural overlap of the model and the actual structure is only 63%, the active site (Inset) and the neighboring residues have been modeled with useful accuracy; for example, it is possible to use the model to plan site-directed mutagenesis experiments for assessing residues critical for catalysis and binding specificity. The model also illustrates that the functionally important regions of the molecule tend to be modeled more accurately than the rest of the protein (Fig. 1B) because they are frequently more conserved in evolution than the rest of the fold. (B) A model for the yeast multi-catalytic protease (YJL001W, red) is compared with its actual crystallographic structure (1rypH) (30). Despite a low sequence identity of 24% to the template structure (1pmaB), the model overlaps with the actual x-ray structure in 92% of the residues (point δ in Fig. 1B). It was possible to predict that this particular model was unusually accurate given its sequence similarity to the template because it had a favorable Z-score of −8.3 and an energy profile with only one positive peak (19). The YJL001W subunit is part of the 20S proteasome, a highly ordered ring-shaped structure consisting of 14 similar subunits, all of which have been modeled in this study. The models are sufficiently accurate for use with protein–protein docking programs, which in turn are likely to predict correctly at least some of the interface residues between the subunits (17).
Figure 4
Figure 4
Modeling a putative interaction of a predicted YDL117W SH3 domain with a proline-rich peptide. A segment in the yeast ORF YDL117W sequence (Top) was predicted to be remotely related to the SH3 domains, many of which have known 3D structure (Table 1). The automated prediction was possible because of the sensitivity afforded by evaluating a 3D model implied by the match. The 3D model of the SH3 domain in turn allowed us to address the biochemical function of YDL117W by calculating a 3D model of a complex between the predicted SH3 domain and a putative ligand, a proline-rich peptide (Middle). Inspection of the YDL117W sequence revealed that there is a proline-rich segment downstream from the putative SH3 domain (PLPPLPPLP, positions 212–220). Because this peptide contains the signature PXXP sequence typical of the SH3 binding peptides (39), it was the ligand chosen for the modeling of the complex; both inter- and intramolecular interactions between SH3 domains and Pro-rich peptides already have been documented (39). A model of the complex was obtained by the same comparative method as the model of the SH3 domain (15), relying on the crystallographic structure of the complex between the FYN SH3 domain and its peptide ligand (PPAYPPPPVP) (40). The predicted SH3 domain is shown in the surface representation (41), with the ball-and-stick model of the peptide (red) lying in the binding site. The SH3 residues making hydrophobic contacts and hydrogen bonds to the ligand peptide are colored in green and blue, respectively. The bottom panel shows a schematic representation of the SH3-peptide interaction (42). The peptide atoms that interact with the SH3 residues are shown as filled spheres, hydrogen bonds are represented by dashed lines, and hydrophobic interactions are indicated by the spiked semicircles. This model facilitates designing experiments such as site-directed mutagenesis for maping of functionally important residues on the SH3 domain and its ligand. This should be compared to the starting point at which no functional information about this ORF or about the proteins related to it was known. More generally, the wealth of information in the bottom two panels relative to the top, sequence-only panel is a case in point for the utility of structural models in planning biological experiments (see also text). For the many proteins whose structures have not been determined by experiment, maximal structural information is obtained by both (i) establishing a match to a known protein structure and (ii) calculating an all-atom 3D model based on that match by using the methods described in this paper.

References

    1. Oliver S G. Nature (London) 1996;379:597–600. - PubMed
    1. Koonin E V, Mushegian A R. Curr Opin Gen Dev. 1996;6:757–762. - PubMed
    1. Dujon B. Trends Genet. 1996;12:263–270. - PubMed
    1. Orengo C A, Jones D T, Thornton J M. Nature (London) 1994;372:631–634. - PubMed
    1. Miklos G L G, Rubin G M. Cell. 1996;86:521–529. - PubMed

Publication types

Substances