Comparing genomes in terms of protein structure: surveys of a finite parts list
- PMID: 10357579
- DOI: 10.1111/j.1574-6976.1998.tb00371.x
Comparing genomes in terms of protein structure: surveys of a finite parts list
Abstract
We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.
Similar articles
-
Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census.Proteins. 1998 Dec 1;33(4):518-34. doi: 10.1002/(sici)1097-0134(19981201)33:4<518::aid-prot5>3.0.co;2-j. Proteins. 1998. PMID: 9849936
-
PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information.Nucleic Acids Res. 2001 Apr 15;29(8):1750-64. doi: 10.1093/nar/29.8.1750. Nucleic Acids Res. 2001. PMID: 11292848 Free PMC article.
-
A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure.J Mol Biol. 1997 Dec 12;274(4):562-76. doi: 10.1006/jmbi.1997.1412. J Mol Biol. 1997. PMID: 9417935
-
Global perspectives on proteins: comparing genomes in terms of folds, pathways and beyond.Pharmacogenomics J. 2001;1(2):115-25. doi: 10.1038/sj.tpj.6500021. Pharmacogenomics J. 2001. PMID: 11911438 Review.
-
What is bioinformatics? A proposed definition and overview of the field.Methods Inf Med. 2001;40(4):346-58. Methods Inf Med. 2001. PMID: 11552348 Review.
Cited by
-
Origin and evolution of protein fold designs inferred from phylogenomic analysis of CATH domain structures in proteomes.PLoS Comput Biol. 2013;9(3):e1003009. doi: 10.1371/journal.pcbi.1003009. Epub 2013 Mar 28. PLoS Comput Biol. 2013. PMID: 23555236 Free PMC article.
-
Estimating the probability for a protein to have a new fold: A statistical computational model.Proc Natl Acad Sci U S A. 2000 May 9;97(10):5161-6. doi: 10.1073/pnas.090559497. Proc Natl Acad Sci U S A. 2000. PMID: 10792051 Free PMC article.
-
Composite S-layer lipid structures.J Struct Biol. 2009 Oct;168(1):207-16. doi: 10.1016/j.jsb.2009.03.004. Epub 2009 Mar 20. J Struct Biol. 2009. PMID: 19303933 Free PMC article. Review.
-
Archaea: the first domain of diversified life.Archaea. 2014 Jun 2;2014:590214. doi: 10.1155/2014/590214. eCollection 2014. Archaea. 2014. PMID: 24987307 Free PMC article. Review.
-
Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome.Nucleic Acids Res. 2000 Aug 15;28(16):3075-82. doi: 10.1093/nar/28.16.3075. Nucleic Acids Res. 2000. PMID: 10931922 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Research Materials