Towards fully automated structure-based function prediction in structural genomics: a case study

James D Watson¹, Steve Sanderson, Alexandra Ezersky, Alexei Savchenko, Aled Edwards, Christine Orengo, Andrzej Joachimiak, Roman A Laskowski, Janet M Thornton

Affiliations

PMID: 17316683
PMCID: PMC2566530
DOI: 10.1016/j.jmb.2007.01.063

Towards fully automated structure-based function prediction in structural genomics: a case study

James D Watson et al. J Mol Biol. 2007.

. 2007 Apr 13;367(5):1511-22.

doi: 10.1016/j.jmb.2007.01.063. Epub 2007 Jan 30.

Authors

James D Watson¹, Steve Sanderson, Alexandra Ezersky, Alexei Savchenko, Aled Edwards, Christine Orengo, Andrzej Joachimiak, Roman A Laskowski, Janet M Thornton

Affiliation

¹ EMBL--European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. watson@ebi.ac.uk

PMID: 17316683
PMCID: PMC2566530
DOI: 10.1016/j.jmb.2007.01.063

Abstract

As the global Structural Genomics projects have picked up pace, the number of structures annotated in the Protein Data Bank as hypothetical protein or unknown function has grown significantly. A major challenge now involves the development of computational methods to assign functions to these proteins accurately and automatically. As part of the Midwest Center for Structural Genomics (MCSG) we have developed a fully automated functional analysis server, ProFunc, which performs a battery of analyses on a submitted structure. The analyses combine a number of sequence-based and structure-based methods to identify functional clues. After the first stage of the Protein Structure Initiative (PSI), we review the success of the pipeline and the importance of structure-based function prediction. As a dataset, we have chosen all structures solved by the MCSG during the 5 years of the first PSI. Our analysis suggests that two of the structure-based methods are particularly successful and provide examples of local similarity that is difficult to identify using current sequence-based methods. No one method is successful in all cases, so, through the use of a number of complementary sequence and structural approaches, the ProFunc server increases the chances that at least one method will find a significant hit that can help elucidate function. Manual assessment of the results is a time-consuming process and subject to individual interpretation and human error. We present a method based on the Gene Ontology (GO) schema using GO-slims that can allow the automated assessment of hits with a success rate approaching that of expert manual assessment.

PubMed Disclaimer

Figures

**Figure 1. Breakdown of prior information for the 282 MCSG structures**
The pie chart illustrates the proportion of the 282 non-redundant structures classed as “known function”, “putative function” or “unknown function”.

**Figure 2**
Figure 2a: EC wheel for 92 proteins of known function The EC wheel illustrates the proportion of known function proteins with different Enzyme Commission numbers. The central core corresponds to the top level of the E.C. schema and is the source of the colouring:
Red = E.C. 1.-.-.- (Oxidoreductases)
Blue = E.C. 2.-.-.- (Transferases)
Green = E.C. 3.-.-.- (Hydrolases)
Yellow = E.C. 4.-.-.- (Lyases)
Purple = E.C. 5.-.-.- (Isomerases)
Orange = E.C. 6.-.-.- (Ligases)
Each shell then corresponds to the next stage down the E.C. schema through the second, third and finally the fourth level. Figure 2b: Pie chart showing distribution of EC classes in the entire PDB The proportions illustrated are taken from the numbers of PDB entries in the PDB with each top level E.C. number. This information is extracted from the Enzyme Structures Database at the EBI (http://www.ebi.ac.uk/thornton-srv/databases/enzymes/). Figure 2c: Map showing the coverage of the generic GO-slim by the MCSG dataset Any MCSG structures from the full dataset annotated with GO terms had all their GO-terms extracted and the associated GO-slim terms derived from the GOA-GOslim mapping file. All GO-slims from the “Molecular Function” branch of the gene ontology were mapped. Those GO-slim terms found in the annotations of the MCSG structures are coloured green whereas those coloured red are not covered by the MCSG dataset. The numbers in brackets correspond to the number of terms added at that point in the hierarchy by the extended GO-slim and shows the spread of the additional information.

**Figure 3. ProFunc results for proteins of known function**
The 92 proteins classed as having “known function” in the MCSG dataset were analysed using ProFunc. The top hit (after parsing for release dates) was classified by success and strength of hit. Those hits to hypothetical proteins or members of families/domains of unknown function are classified as “unknown”. The structure-based methods used by ProFunc are as follows: SSM – Secondary Structure Matching (MSDfold): fold comparison service. ENZ – Enzyme template search (Catalytic Site Atlas data) LIG – Ligand binding template search (Automatically generated templates) DNA – DNA binding template search (Automatically generated templates) SIT – SiteSeer (“Reverse template” method)

**Figure 4. ROC curves for SSM and SIT based on manual function assignment**
The ROC curves are plotted for SSM results and for SiteSeer (“reverse template”) results. The cut-off used by SSM is the Z-score of the hit, whereas it is the E-value that is of interest in SiteSeer (reverse templates). The ideal curve would rise vertically from the origin and then horizontally out to the right and would give an area under the curve of 1. The plot shows that the SSM Z-score appears to be a better measure for distinguishing between true and false positives than the SiteSeer (“reverse template”) measures.

See this image and copyright information in PMC

References

1. Blundell TL, Mizuguchi K. Structural genomics: an overview. Prog Biophys Mol Biol. 2000;73:289–295. - PubMed
1. Chen L, Oughtred R, Berman HM, Westbrook J. TargetDB: a target registration database for structural genomics projects. Bioinformatics. 2004;20:2860–2862. - PubMed
1. Watson JD, et al. Target selection and determination of function in structural genomics. IUBMB Life. 2003;55:249–255. - PMC - PubMed
1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
1. Bairoch A, et al. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2005;33:D154–D159. - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Towards fully automated structure-based function prediction in structural genomics: a case study

Affiliation

Towards fully automated structure-based function prediction in structural genomics: a case study

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials