Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2000 Mar 14;97(6):2450-5.
doi: 10.1073/pnas.050589297.

Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing

Affiliations

Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing

P Mallick et al. Proc Natl Acad Sci U S A. .

Abstract

Three-dimensional protein folds were assigned to all ORFs of the recently sequenced genome of the hyperthermophilic archaeon Pyrobaculum aerophilum. Binary hypothesis testing was used to estimate a confidence level for each assignment. A separate test was conducted to assign a probability for whether each sequence has a novel fold-i.e., one that is not yet represented in the experimental database of known structures. Of the 2,130 predicted nontransmembrane proteins in this organism, 916 matched a fold at a cumulative 90% confidence level, and 245 could be assigned at a 99% confidence level. Likewise, 286 proteins were predicted to have a previously unobserved fold with a 90% confidence level, and 14 at a 99% confidence level. These statistically based tools are combined with homology searches against the Online Mendelian Inheritance in Man (OMIM) human genetics database and other protein databases for the selection of attractive targets for crystallographic or NMR structure determination. Results of these studies have been collated and placed at http://www.doe-mbi.ucla.edu/people/parag/P A_HOME/, the University of California, Los Angeles-Department of Energy Pyrobaculum aerophilum web site.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(a) Distributions of fold assignment scores for correct (dashed line) and incorrect (solid line) matches. A test set of 3,285 experimentally determined domain-folds were used to generate an exhaustive set of 10,784,656 (3,285 × 3,284) sequence–structure assignment pairs, excluding the assignment of any sequence to its own structure. Each pair was assigned a sequence–structure compatibility Z-score by the SDP method (6). Structures were compared with the dali algorithm and are designated structurally similar if their dali Z-Score is greater than or equal to 2. We assert the binary hypothesis that an assigned structure for sequence A matches the true structure of A (dashed line) or does not match the true structure of A (solid line). The distributions of scores for the two cases show that similar pairs have higher sequence–structure match scores than do nonsimilar pairs. (Inset) Fold assignment probability curve. These distributions give the likelihood that an assigned fold for a protein A matches the actual structure of protein A as a function sequence–structure Z-score, as explained in the text. (B) Probability of correct fold assignment for fraction of genome proteins assigned. Folds were assigned to each of the predicted soluble 2,130 ORFs within the PA genome. Each sequence within the genome was assigned to the structure with the highest sequence–structure compatibility Z-score. Z-scores map to probability values via the Inset of A. Each bar shows the number of ORFs assigned as a function of probability value. Summing the bar chart (dark line) shows the fraction of the genome assigned a fold as a function of probability value. Summing the bar chart weighted by probability values shows the cumulative number of assignments predicted as a function of probability value (dashed line).
Figure 2
Figure 2
(A) Distribution of Z-max scores for similar folds included (solid line) and excluded (dashed line) from the fold library. Two distributions of maximum nonself Z-scores were obtained: one where a similar fold exists in the training set, and a second where similar structures have been excluded from the library. The separation between these two distributions shows that the Z-max score is a good indicator of the presence of similar folds in the library. (B) Probability of correct novel fold assignment for fraction of genome proteins assigned. The probability of a novel fold was determined for each soluble ORF product of PA. The bar chart shows the number of ORFs predicted to have novel folds as a function of probability value. The fraction of the genome predicted to be novel as a function of probability value is given by the solid curve obtained by summing the bar chart. A sum of the bar chart, weighted by probability value, shows the cumulative number of accurate predictions as a function of probability value (dashed line).
Figure 3
Figure 3
The 2,681 ORFs of the genome of PA partitioned into homologs of human disease proteins (208, 8%, white region), membrane-spanning proteins (320, 12%, horizontal line region), and proteins having >4 homologs in other organisms (482, 18%, vertical line region). Attractive initial targets for structural genomics are proteins without transmembrane regions, with human disease relevance, and having many homologs in other genomes (422, 16%, star region). Additional ORFs had both >4 homologs in other organisms and transmembrane helix regions (102, 4%, crosshatch region), or both human disease homologs and transmembrane helix regions (60, 2%, light gray region). A few proteins had >4 homologs in other organisms, and human disease homologs and transmembrane helix regions (69, 3%, darker gray region). There are 1,018 ORFs belonging to none of these categories (37%, black region).
Figure 4
Figure 4
A test of psi-blast for automated fold assignment. Using psi-blast with the SwissProt sequence database, we used 3,285 sequences from our training set to generate an exhaustive set of 10,784,656 (3,285 × 3,284) nonself sequence–sequence assignment pairs. Using binary hypothesis testing, we divided the resulting set of scores into two cases. For the set of correct matches (dashed line), the actual structures for the two sequences were similar as determined by a dali Z-score greater than or equal to 2. The set of incorrect matches is also shown (solid line). The reduced separation of these cases compared with the results for SDP shown in Fig. 1A implies that confidence intervals may be more difficult to generate by using psi-blast.

References

    1. Fitz-Gibbon S, Choi A J, Miller J H, Stetter K O, Simon M I, Swanson R, Kim U J. Extremophiles. 1997;1:36–51. - PubMed
    1. Holm L, Sander C. Trends Biochem Sci. 1995;20:478–480. - PubMed
    1. Holm L, Sander C. Nucleic Acids Res. 1996;24:206–209. - PMC - PubMed
    1. Holm L, Sander C. Nucleic Acids Res. 1997;25:231–234. - PMC - PubMed
    1. Holm L, Sander C. Nucleic Acids Res. 1998;26:316–319. - PMC - PubMed

Publication types

LinkOut - more resources