Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Apr 13;367(5):1511-22.
doi: 10.1016/j.jmb.2007.01.063. Epub 2007 Jan 30.

Towards fully automated structure-based function prediction in structural genomics: a case study

Affiliations

Towards fully automated structure-based function prediction in structural genomics: a case study

James D Watson et al. J Mol Biol. .

Abstract

As the global Structural Genomics projects have picked up pace, the number of structures annotated in the Protein Data Bank as hypothetical protein or unknown function has grown significantly. A major challenge now involves the development of computational methods to assign functions to these proteins accurately and automatically. As part of the Midwest Center for Structural Genomics (MCSG) we have developed a fully automated functional analysis server, ProFunc, which performs a battery of analyses on a submitted structure. The analyses combine a number of sequence-based and structure-based methods to identify functional clues. After the first stage of the Protein Structure Initiative (PSI), we review the success of the pipeline and the importance of structure-based function prediction. As a dataset, we have chosen all structures solved by the MCSG during the 5 years of the first PSI. Our analysis suggests that two of the structure-based methods are particularly successful and provide examples of local similarity that is difficult to identify using current sequence-based methods. No one method is successful in all cases, so, through the use of a number of complementary sequence and structural approaches, the ProFunc server increases the chances that at least one method will find a significant hit that can help elucidate function. Manual assessment of the results is a time-consuming process and subject to individual interpretation and human error. We present a method based on the Gene Ontology (GO) schema using GO-slims that can allow the automated assessment of hits with a success rate approaching that of expert manual assessment.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Breakdown of prior information for the 282 MCSG structures
The pie chart illustrates the proportion of the 282 non-redundant structures classed as “known function”, “putative function” or “unknown function”.
Figure 2
Figure 2
Figure 2a: EC wheel for 92 proteins of known function The EC wheel illustrates the proportion of known function proteins with different Enzyme Commission numbers. The central core corresponds to the top level of the E.C. schema and is the source of the colouring:
  1. Red = E.C. 1.-.-.- (Oxidoreductases)

  2. Blue = E.C. 2.-.-.- (Transferases)

  3. Green = E.C. 3.-.-.- (Hydrolases)

  4. Yellow = E.C. 4.-.-.- (Lyases)

  5. Purple = E.C. 5.-.-.- (Isomerases)

  6. Orange = E.C. 6.-.-.- (Ligases)

Each shell then corresponds to the next stage down the E.C. schema through the second, third and finally the fourth level. Figure 2b: Pie chart showing distribution of EC classes in the entire PDB The proportions illustrated are taken from the numbers of PDB entries in the PDB with each top level E.C. number. This information is extracted from the Enzyme Structures Database at the EBI (http://www.ebi.ac.uk/thornton-srv/databases/enzymes/). Figure 2c: Map showing the coverage of the generic GO-slim by the MCSG dataset Any MCSG structures from the full dataset annotated with GO terms had all their GO-terms extracted and the associated GO-slim terms derived from the GOA-GOslim mapping file. All GO-slims from the “Molecular Function” branch of the gene ontology were mapped. Those GO-slim terms found in the annotations of the MCSG structures are coloured green whereas those coloured red are not covered by the MCSG dataset. The numbers in brackets correspond to the number of terms added at that point in the hierarchy by the extended GO-slim and shows the spread of the additional information.
Figure 2
Figure 2
Figure 2a: EC wheel for 92 proteins of known function The EC wheel illustrates the proportion of known function proteins with different Enzyme Commission numbers. The central core corresponds to the top level of the E.C. schema and is the source of the colouring:
  1. Red = E.C. 1.-.-.- (Oxidoreductases)

  2. Blue = E.C. 2.-.-.- (Transferases)

  3. Green = E.C. 3.-.-.- (Hydrolases)

  4. Yellow = E.C. 4.-.-.- (Lyases)

  5. Purple = E.C. 5.-.-.- (Isomerases)

  6. Orange = E.C. 6.-.-.- (Ligases)

Each shell then corresponds to the next stage down the E.C. schema through the second, third and finally the fourth level. Figure 2b: Pie chart showing distribution of EC classes in the entire PDB The proportions illustrated are taken from the numbers of PDB entries in the PDB with each top level E.C. number. This information is extracted from the Enzyme Structures Database at the EBI (http://www.ebi.ac.uk/thornton-srv/databases/enzymes/). Figure 2c: Map showing the coverage of the generic GO-slim by the MCSG dataset Any MCSG structures from the full dataset annotated with GO terms had all their GO-terms extracted and the associated GO-slim terms derived from the GOA-GOslim mapping file. All GO-slims from the “Molecular Function” branch of the gene ontology were mapped. Those GO-slim terms found in the annotations of the MCSG structures are coloured green whereas those coloured red are not covered by the MCSG dataset. The numbers in brackets correspond to the number of terms added at that point in the hierarchy by the extended GO-slim and shows the spread of the additional information.
Figure 2
Figure 2
Figure 2a: EC wheel for 92 proteins of known function The EC wheel illustrates the proportion of known function proteins with different Enzyme Commission numbers. The central core corresponds to the top level of the E.C. schema and is the source of the colouring:
  1. Red = E.C. 1.-.-.- (Oxidoreductases)

  2. Blue = E.C. 2.-.-.- (Transferases)

  3. Green = E.C. 3.-.-.- (Hydrolases)

  4. Yellow = E.C. 4.-.-.- (Lyases)

  5. Purple = E.C. 5.-.-.- (Isomerases)

  6. Orange = E.C. 6.-.-.- (Ligases)

Each shell then corresponds to the next stage down the E.C. schema through the second, third and finally the fourth level. Figure 2b: Pie chart showing distribution of EC classes in the entire PDB The proportions illustrated are taken from the numbers of PDB entries in the PDB with each top level E.C. number. This information is extracted from the Enzyme Structures Database at the EBI (http://www.ebi.ac.uk/thornton-srv/databases/enzymes/). Figure 2c: Map showing the coverage of the generic GO-slim by the MCSG dataset Any MCSG structures from the full dataset annotated with GO terms had all their GO-terms extracted and the associated GO-slim terms derived from the GOA-GOslim mapping file. All GO-slims from the “Molecular Function” branch of the gene ontology were mapped. Those GO-slim terms found in the annotations of the MCSG structures are coloured green whereas those coloured red are not covered by the MCSG dataset. The numbers in brackets correspond to the number of terms added at that point in the hierarchy by the extended GO-slim and shows the spread of the additional information.
Figure 3
Figure 3. ProFunc results for proteins of known function
The 92 proteins classed as having “known function” in the MCSG dataset were analysed using ProFunc. The top hit (after parsing for release dates) was classified by success and strength of hit. Those hits to hypothetical proteins or members of families/domains of unknown function are classified as “unknown”. The structure-based methods used by ProFunc are as follows: SSM – Secondary Structure Matching (MSDfold): fold comparison service. ENZ – Enzyme template search (Catalytic Site Atlas data) LIG – Ligand binding template search (Automatically generated templates) DNA – DNA binding template search (Automatically generated templates) SIT – SiteSeer (“Reverse template” method)
Figure 4
Figure 4. ROC curves for SSM and SIT based on manual function assignment
The ROC curves are plotted for SSM results and for SiteSeer (“reverse template”) results. The cut-off used by SSM is the Z-score of the hit, whereas it is the E-value that is of interest in SiteSeer (reverse templates). The ideal curve would rise vertically from the origin and then horizontally out to the right and would give an area under the curve of 1. The plot shows that the SSM Z-score appears to be a better measure for distinguishing between true and false positives than the SiteSeer (“reverse template”) measures.

References

    1. Blundell TL, Mizuguchi K. Structural genomics: an overview. Prog Biophys Mol Biol. 2000;73:289–295. - PubMed
    1. Chen L, Oughtred R, Berman HM, Westbrook J. TargetDB: a target registration database for structural genomics projects. Bioinformatics. 2004;20:2860–2862. - PubMed
    1. Watson JD, et al. Target selection and determination of function in structural genomics. IUBMB Life. 2003;55:249–255. - PMC - PubMed
    1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Bairoch A, et al. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2005;33:D154–D159. - PMC - PubMed

Publication types