A question of size: the eukaryotic proteome and the problems in defining it

Paul M Harrison¹, Anuj Kumar, Ning Lang, Michael Snyder, Mark Gerstein

Affiliations

PMID: 11861898
PMCID: PMC101239
DOI: 10.1093/nar/30.5.1083

A question of size: the eukaryotic proteome and the problems in defining it

Paul M Harrison et al. Nucleic Acids Res. 2002.

. 2002 Mar 1;30(5):1083-90.

doi: 10.1093/nar/30.5.1083.

Authors

Paul M Harrison¹, Anuj Kumar, Ning Lang, Michael Snyder, Mark Gerstein

Affiliation

¹ Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, PO Box 208114, New Haven, CT 06520-8114, USA.

PMID: 11861898
PMCID: PMC101239
DOI: 10.1093/nar/30.5.1083

Abstract

We discuss the problems in defining the extent of the proteomes for completely sequenced eukaryotic organisms (i.e. the total number of protein-coding sequences), focusing on yeast, worm, fly and human. (i) Six years after completion of its genome sequence, the true size of the yeast proteome is still not defined. New small genes are still being discovered, and a large number of existing annotations are being called into question, with these questionable ORFs (qORFs) comprising up to one-fifth of the 'current' proteome. We discuss these in the context of an ideal genome-annotation strategy that considers the proteome as a rigorously defined subset of all possible coding sequences ('the orfome'). (ii) Despite the greater apparent complexity of the fly (more cells, more complex physiology, longer lifespan), the nematode worm appears to have more genes. To explain this, we compare the annotated proteomes of worm and fly, relating to both genome-annotation and genome evolution issues. (iii) The unexpectedly small size of the gene complement estimated for the complete human genome provoked much public debate about the nature of biological complexity. However, in the first instance, for the human genome, the relationship between gene number and proteome size is far from simple. We survey the current estimates for the numbers of human genes and, from this, we estimate a range for the size of the human proteome. The determination of this is substantially hampered by the unknown extent of the cohort of pseudogenes ('dead' genes), in combination with the prevalence of alternative splicing. (Further information relating to yeast is available at http://genecensus.org/yeast/orfome)

PubMed Disclaimer

Figures

**Figure 1**
Number of yeast ORFs as a function of the minimum allowed ORF length. The total number of annotated ORFs in the yeast proteome is plotted against minimum ORF length (continuous blue line). A curve for known proteins or proteins confirmed by homology to a known protein is shown (red line), along with a green curve for the remaining ORFs that have no homology to a known protein (or are not otherwise characterized). Also displayed (dotted pink line) is the total number of additional ‘acceptable’ ORFs from the yeast genome that have good codon adaptation (CAI ≥0.11) that do not overlap an annotated gene or other genomic feature. The plots are cumulative backwardly at intervals of 10 residues.

**Figure 2**
The variation in the size of the WormPep database over time. The size of the WormPep database is plotted against time for the period after and just prior to publication of the genome sequence. The dotted line indicates the approximate time of genome sequence completion.

**Figure 3**
Human gene numbers and proteome size. The figure depicts, in bar chart form, the number of human genes (blue bar) from various estimates and a corresponding estimate for proteome size (orange bar). Gene numbers and proteome sizes are shown for the other sequenced eukaryotes, with the same coloring. The size of the human proteome (N_CDS) can be estimated as follows: N_CDS = f₁.f₂.N_genes, where f₁ is the proportion of gene structures that are not pseudogenic, and f₂ is the ratio of the total number of distinct protein-coding transcripts to the total number of genes (arising from alternative splicing). Assuming 0.76 ≤ f₁ ≤ 0.91 (see text), a minimum value of f₂ = 1.16 can be derived from the alternative splicing survey of Mironov *et al*. (72), and a maximum value f₂ = 2.22 is calculable from the alternative splicing analysis in Lander *et al.* (56). Using these, and the wider range for N_genes given by Venter *et al.* (57) a range of approximately 20 300 to approximately 83 800 is yielded for N_CDS. This range is clearly rather large, and is reminiscent of the range of values arising for estimates of N_genes that arose in the months and years prior to publications of the human genome.

See this image and copyright information in PMC

References

1. Petrov D.A. (2001) Evolution of genome size: new approaches to an old problem. Trends Genet., 17, 23–28. - PubMed
1. Claverie J.M. (2001) What if there are only 30,000 human genes? Science, 291, 1255–1257. - PubMed
1. Goffeau A., Barrell,B.G, Bussey,H., Davis,R.W., Dujon,B., Feldmann,H., Galibert,F., Hoheisel,J.D, Jacq,C., Johnston,M. et al. (1996) Life with 6000 genes. Science, 274, 546, 563-567. - PubMed
1. Davis C.A., Grate,L., Spingola,M. and Ares,M.,Jr (2000) Test of intron predictions reveals novel splice sites, alternatively spliced mRNAs and new introns in meiotically regulated genes of yeast. Nucleic Acids Res., 28, 1700–1706. - PMC - PubMed
1. Dujon B. (1996) The yeast genome project: what did we learn? Trends Genet., 12, 263–270. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A question of size: the eukaryotic proteome and the problems in defining it

Affiliation

A question of size: the eukaryotic proteome and the problems in defining it

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases