Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Oct;622(7981):41-47.
doi: 10.1038/s41586-023-06490-x. Epub 2023 Oct 4.

The status of the human gene catalogue

Affiliations
Review

The status of the human gene catalogue

Paulo Amaral et al. Nature. 2023 Oct.

Abstract

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
A major challenge for gene annotation is how to capture the diversity of gene products and functions. For example, although the vast majority of protein-coding genes occur on distinct transcripts, a small number of bi-cistronic transcripts encode two distinct open reading frames on the same transcript. Similarly, introns within protein-coding genes may host noncoding RNAs, including miRNAs, snoRNAs or lncRNAs, which may regulate the transcriptional activity of the locus, or may have catalytic roles unrelated to the main protein product. Alternate splicing of transcripts may give rise to proteins that enhance or inhibit each other. Transcripts that are truncated and cannot produce functional proteins are targeted for nonsense-mediated decay (NMD). These products, together with ubiquitinated proteins (Ub) or unwanted intronic material are rapidly recycled by cellular lysozomes. Other seemingly nonproductive transcripts may be repurposed as functional ncRNAs.
Figure 2:
Figure 2:
predicted and observed human gene counts over time. Counts of protein-coding, pseudogene, and non-coding genes are shown. Time points before 2003 and after 2023 (dashed lines) represent an average of predictions from the literature and extrapolations from this perspective article, respectively. Time points from 2003 to 2023 are based on 20 iterations of the NCBI RefSeq annotation of the human reference genome, including both curated and predicted genes.

Update of

References

    1. Pertea M & Salzberg SL Between a chicken and a grape: estimating the number of human genes. Genome Biol 11, 206, doi:10.1186/gb-2010-11-5-206 (2010). - DOI - PMC - PubMed
    2. Reviews the history of efforts to estimate the human gene count and highlights different computational methods that were used to help with the human gene annotation.

    1. Understanding our genetic inheritance: The US Human Genome Project, the first five years 1991–1995. (U.S. Department of Health and Human Services and U.S. Department of Energy, 1990).
    1. Nurk S et al. The complete sequence of a human genome. Science 376, 44–53, doi:10.1126/science.abj6987 (2022). - DOI - PMC - PubMed
    2. Describes the first-ever complete, gap-free assembly and annotation of a human genome, which added 140 protein-coding genes and several thousand additional noncoding genes to the human gene catalogue.

    1. The Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74, doi:10.1038/nature11247 (2012). - DOI - PMC - PubMed
    1. Kawaji H, Kasukawa T, Forrest A, Carninci P & Hayashizaki Y The FANTOM5 collection, a data series underpinning mammalian transcriptome atlases in diverse cell types. Sci Data 4, 170113, doi:10.1038/sdata.2017.113 (2017). - DOI - PMC - PubMed

Publication types