Target space for structural genomics revisited
- PMID: 12117789
- DOI: 10.1093/bioinformatics/18.7.922
Target space for structural genomics revisited
Abstract
Motivation: Structural genomics eventually aims at determining structures for all proteins. However, in the beginning experimentalists are likely to focus on globular proteins to achieve a rapid basic coverage of protein sequence space. How many proteins will structural genomics have to target? How many proteins will be excluded since we already have structural information for these or since they are not globular? We have to answer these questions in the context of our target selection for the North-East Structural Genomics Consortium (NESG).
Results: We estimated that structural information is available for about 6-38% of all proteins; 6% if we require high accuracy in comparative modelling, 38% if we are satisfied with having a rough idea about the fold. Excluding all regions that are not globular, we found that structural genomics may have to target about 48% of all proteins. This corresponded to a similar percentage of residues of the entire proteomes (52%). We explored a number of different strategies to cluster protein space in order to find the number of families representing these 48% of structurally unknown proteins. For the subset of all entirely sequenced eukaryotes, we found over 18 000 fragment clusters each of which may be a suitable target for structural genomics.
Availability: All data are available from the authors, most results are summarized at: http://cubic.bioc.columbia.edu/genomes/RES/2002_bioinformatics/
Similar articles
-
Clustering of proximal sequence space for the identification of protein families.Bioinformatics. 2002 Jul;18(7):908-21. doi: 10.1093/bioinformatics/18.7.908. Bioinformatics. 2002. PMID: 12117788
-
Selecting targets for structural determination by navigating in a graph of protein families.Bioinformatics. 2002 Jul;18(7):899-907. doi: 10.1093/bioinformatics/18.7.899. Bioinformatics. 2002. PMID: 12117787
-
About the use of protein models.Bioinformatics. 2002 Jul;18(7):934-8. doi: 10.1093/bioinformatics/18.7.934. Bioinformatics. 2002. PMID: 12117790
-
Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book.Nat Methods. 2004 Dec;1(3):195-202. doi: 10.1038/nmeth725. Nat Methods. 2004. PMID: 15789030 Review.
-
The SUPERFAMILY database in structural genomics.Acta Crystallogr D Biol Crystallogr. 2002 Nov;58(Pt 11):1897-900. doi: 10.1107/s0907444902015160. Epub 2002 Oct 21. Acta Crystallogr D Biol Crystallogr. 2002. PMID: 12393919 Review.
Cited by
-
Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space.Nucleic Acids Res. 2006 Feb 15;34(3):1066-80. doi: 10.1093/nar/gkj494. Print 2006. Nucleic Acids Res. 2006. PMID: 16481312 Free PMC article.
-
A novel member of the split betaalphabeta fold: Solution structure of the hypothetical protein YML108W from Saccharomyces cerevisiae.Protein Sci. 2003 May;12(5):1136-40. doi: 10.1110/ps.0240903. Protein Sci. 2003. PMID: 12717036 Free PMC article.
-
NORSp: Predictions of long regions without regular secondary structure.Nucleic Acids Res. 2003 Jul 1;31(13):3833-5. doi: 10.1093/nar/gkg515. Nucleic Acids Res. 2003. PMID: 12824431 Free PMC article.
-
3D complex: a structural classification of protein complexes.PLoS Comput Biol. 2006 Nov 17;2(11):e155. doi: 10.1371/journal.pcbi.0020155. Epub 2006 Oct 5. PLoS Comput Biol. 2006. PMID: 17112313 Free PMC article.
-
PEP: Predictions for Entire Proteomes.Nucleic Acids Res. 2003 Jan 1;31(1):410-3. doi: 10.1093/nar/gkg102. Nucleic Acids Res. 2003. PMID: 12520036 Free PMC article.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources