Comparing function and structure between entire proteomes

J Liu¹, B Rost

Affiliations

PMID: 11567088
PMCID: PMC2374214
DOI: 10.1110/ps.10101

Comparing function and structure between entire proteomes

J Liu et al. Protein Sci. 2001 Oct.

. 2001 Oct;10(10):1970-9.

doi: 10.1110/ps.10101.

Authors

J Liu¹, B Rost

Affiliation

¹ CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA.

PMID: 11567088
PMCID: PMC2374214
DOI: 10.1110/ps.10101

Abstract

More than 30 organisms have been sequenced entirely. Here, we applied a variety of simple bioinformatics tools to analyze 29 proteomes for representatives from all three kingdoms: eukaryotes, prokaryotes, and archaebacteria. We confirmed that eukaryotes have relatively more long proteins than prokaryotes and archaes, and that the overall amino acid composition is similar among the three. We predicted that approximately 15%-30% of all proteins contained transmembrane helices. We could not find a correlation between the content of membrane proteins and the complexity of the organism. In particular, we did not find significantly higher percentages of helical membrane proteins in eukaryotes than in prokaryotes or archae. However, we found more proteins with seven transmembrane helices in eukaryotes and more with six and 12 transmembrane helices in prokaryotes. We found twice as many coiled-coil proteins in eukaryotes (10%) as in prokaryotes and archaes (4%-5%), and we predicted approximately 15%-25% of all proteins to be secreted by most eukaryotes and prokaryotes. Every tenth protein had no known homolog in current databases, and 30%-40% of the proteins fell into structural families with >100 members. A classification by cellular function verified that eukaryotes have a higher proportion of proteins for communication with the environment. Finally, we found at least one homolog of experimentally known structure for approximately 20%-45% of all proteins; the regions with structural homology covered 20%-30% of all residues. These numbers may or may not suggest that there are 1200-2600 folds in the universe of protein structures. All predictions are available at http://cubic.bioc.columbia.edu/genomes.

PubMed Disclaimer

Figures

**Fig. 1.**
(A) The distribution of length of ORFs (in bins of 10 residues) in 29 genomes (cumulative values in the insets). The extreme value distribution fit is shown in bold. The abbreviations for the organisms are given in Table 1. (B) Amino acid composition for six representative genomes: The letter height is proportional to the observed composition of the respective amino acid (one-letter code). (C) Percentage of membrane proteins, coiled-coil proteins, and proteins with signal peptides in 29 genomes. (D) Less than half of the predicted membrane proteins could have been ideied through homology with known membrane proteins.

**Fig. 2.**
(A) Fraction of membrane proteins with different numbers of predicted transmembrane segments. White bars: proteins with topology "in." Black bars: proteins with topology "out." (B) Contour plot showing the relation between ORF length (in bins of 10 residues) and the number of predicted membrane helices for two representative organisms.

**Fig. 3.**
Functional classification of genomes. (A) Super-class distribution for the genomes: We grouped the 14 EUCLID classes into Energy, Communication, and Information super-classes. (B) Distribution of 13-category classification for selected genomes, including those without functional classification.

**Fig. 4.**
For each protein in all proteomes, we counted the number of proteins found in the respective family at a PSI-BLAST E value < 10⁻³. The graphs show the cumulative percentages of proteins found in families of particular sizes. For example, ∼5%–10% of all ORFs were orphans, i.e., had no homolog in current databases; 30%–40% were in families with > 100 members.

**Fig. 5.**
Structural annotation of genomes. (A) 25%–40% of all ORFs were sequence similar to at least one PDB protein. (B) The total percentage of residues that could thus be homology modeled amounted to ∼20%–30% of all residues.

See this image and copyright information in PMC

References

1. 1997. The Yeast Genome Directory. Nature 387 5. - PubMed
1. 1998. Genome sequence of the nematode C. elegans: A platform for investigating biology. The C. elegans Sequencing Consortium (publ. errata appear in Science 1999, 283: 35, 283: 2103, 285: 1493). Science 288 2012–2018. - PubMed
1. Adams, M.D., Celniker, S.E., Holt.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. 2000. The genome sequence of Drosophila melanogaster. Science 287 2185–2195. - PubMed
1. Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. 1997. Gapped Blast and PSI-Blast: A new generation of protein database search programs. Nucl. Acids Res. 25 3389–3402. - PMC - PubMed
1. Andersson, S.G., Zomorodipour, A., Andersson, J.O., Sicheritz-Ponten, T., Alsmark, U.C., Podowski, R.M., Naslund, A.K., Eriksson, A.S., Winkler, H.H., and Kurland, C.G. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396 133–140. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparing function and structure between entire proteomes

Affiliation

Comparing function and structure between entire proteomes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources