Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 6;51(D1):D942-D949.
doi: 10.1093/nar/gkac1071.

GENCODE: reference annotation for the human and mouse genomes in 2023

Affiliations

GENCODE: reference annotation for the human and mouse genomes in 2023

Adam Frankish et al. Nucleic Acids Res. .

Abstract

GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Upstream open reading frames in MID1. GENCODE 41 annotation includes five distinct transcript start site regions within midline 1 (MID1), a TRIM-family protein-coding gene. Five representative transcripts are shown here; additional transcriptional complexity has been omitted for clarity. Three replicated Ribo-seq ORFs are located on three transcripts: cXriboseqorf7, cXriboseqorf6 and cXriboseqorf5 found respectively on ENST00000380787, ENST00000317552 (the MANE Select transcript) and ENST00000317552. Each is a translated uORF, with three distinct first exon ORF portions being linked to a shared second exon ORF portion. The shared ORF portion has a positive PhyloCSF score, indicating that it has evolved as a protein-coding sequence. However, PhyloCSF only supports the protein-coding potential of one of the three alternative first exons, cXriboseqorf5 on ENST00000317552. A multispecies protein alignment (inset) finds that cXriboseqorf5 has intact orthologs across tetrapods with accompanying transcript support; beyond the five representative species shown, the ORF appears potentially conserved across vertebrates. This ORF has thus been annotated as the new protein-coding gene ENSG00000291314. In contrast, the first exon ORF portions of cXriboseqorf7 and cXriboseqorf6 present equivocal evolutionary signatures, lacking PhyloCSF support to indicate protein-level function. Nonetheless, cXriboseqorf7 at least is conserved as an ORF in mammals as well as reptiles and avians, and if this ORF is not protein-coding it may turn out to have a regulatory function that is evolving under a different mode of selection. This may also be true of cXriboseqorf6, and in fact, we do not rule out the possibility that both cXriboseqorf7 and cXriboseqorf6 encode functional proteins in spite of the lack of PhyloCSF support.
Figure 2.
Figure 2.
Screenshot from the Ensembl genome browser of the transcript view page for the gene CASP12 which contains transcripts annotated as Protein coding LoF. The status of the gene as having both functional and non-functional alleles is indicated by the dark blue box. The annotation of nonsense-mediated decay transcripts with fixed premature stop codons is indicated by the light blue box and the locations of the Protein coding LoF biotype flag are highlighted by the red box.

References

    1. Frankish A., Diekhans M., Jungreis I., Lagarde J., Loveland J.E., Mudge J.M., Sisu C., Wright J.C., Armstrong J., Barnes I.et al. .. gencode 2021. Nucleic Acids Res. 2021; 49:D916–D923. - PMC - PubMed
    1. Frankish A., Diekhans M., Ferreira A.-M., Johnson R., Jungreis I., Loveland J., Mudge J.M., Sisu C., Wright J., Armstrong J.et al. .. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019; 47:D766–D773. - PMC - PubMed
    1. Harrow J., Frankish A., Gonzalez J.M., Tapanari E., Diekhans M., Kokocinski F., Aken B.L., Barrell D., Zadissa A., Searle S.et al. .. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 2012; 22:1760–1774. - PMC - PubMed
    1. Frankish A., Uszczynska B., Ritchie G.R.S., Gonzalez J.M., Pervouchine D., Petryszak R., Mudge J.M., Fonseca N., Brazma A., Guigo R.et al. .. Comparison of GENCODE and refseq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics. 2015; 16(Suppl. 8):S2. - PMC - PubMed
    1. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al. .. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D45. - PMC - PubMed

Publication types