Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2004 Jun;4(6):1712-26.
doi: 10.1002/pmic.200300700.

Has the yo-yo stopped? An assessment of human protein-coding gene number

Affiliations
Review

Has the yo-yo stopped? An assessment of human protein-coding gene number

Christopher Southan. Proteomics. 2004 Jun.

Abstract

Since the identification of approximately 25,000 proteins from the draft human genome assembly in 2001, estimates of the total have oscillated between 30,000 and 70,000. The recently announced genome closure has not generated a consensus gene count despite this being a key parameter for many areas of biology including drug target discovery and characterization of the human proteome. Contrary to earlier predictions of constitutive under-detection for eukaryotic genes, the latest model organism updates have produced minor increases in the worm but fly and yeast gene numbers have decreased. The postdraft, precompletion interval has produced large increases in human transcript coverage, continuous improvements in genome assembly and refinements in automated genomic annotation. Notably these enhancements have resulted in an Ensembl human protein-coding gene number of 22,184, a decrease of 1862 since the first release. Longitudinal database surveys indicate that redundancy-reduced human mRNA and protein collections are flattening out at approximately 28,000, although Ensembl maps approximately 20,000 known sequences. Observations suggest high-throughput cloning projects are predominantly extending known genes or sampling new splice forms and novel protein discovery has slowed to a trickle. The hypothesis that substantial numbers of short proteins remain experimentally and computationally undetected in mammalian genomes is neither supported by sequence data nor by the extensive homology between mouse and human proteins. Aggregating the independent annotations for complete transcripts from seven completed human chromosomes extrapolates to approximately 25,000 genes. The inclusion of partial putative genes would increase this to above 30,000 but recent data suggest these represent predominantly nonprotein-coding transcripts. Mass spectrometry-based proteomics has already verified more than 10% of human genes but has not identified significant numbers of unpredicted proteins. The available data are thus converging to a basal protein-coding gene number well below 30,000, which could even be as low as 25,000.

PubMed Disclaimer

LinkOut - more resources