Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 20;20(11):e1012543.
doi: 10.1371/journal.pcbi.1012543. eCollection 2024 Nov.

Upstream open reading frames may contain hundreds of novel human exons

Affiliations

Upstream open reading frames may contain hundreds of novel human exons

Hyun Joo Ji et al. PLoS Comput Biol. .

Abstract

Several recent studies have presented evidence that the human gene catalogue should be expanded to include thousands of short open reading frames (ORFs) appearing upstream or downstream of existing protein-coding genes, each of which might create an additional bicistronic transcript in humans. Here we explore an alternative hypothesis that would explain the translational and evolutionary evidence for these upstream ORFs without the need to create novel genes or bicistronic transcripts. We examined 2,199 upstream ORFs that have been proposed as high-quality candidates for novel genes, to determine if they could instead represent protein-coding exons that can be added to existing genes. We checked for the conservation of these ORFs in four recently sequenced, high-quality human genomes, and found a large majority (87.8%) to be conserved in all four as expected. We then looked for splicing evidence that would connect each upstream ORF to the downstream protein-coding gene at the same locus, thus creating a novel splicing variant using the upstream ORF as its first exon. These protein coding exon candidates were further evaluated using protein structure predictions of the protein sequences that included the proposed new exons. We determined that 541 out of 2,199 upstream ORFs have strong evidence that they can form protein coding exons that are part of an existing gene, and that the resulting protein is predicted to have similar or better structural quality than the currently annotated isoform.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interest exists.

Figures

Fig 1
Fig 1. Each upstream ORF (uORF) was aligned to multiple human genomes, using both the genomic sequence and the annotated transcripts.
The transcriptome alignment handled cases where a uORF spanned two exons in the 5’UTR of an annotated transcript.
Fig 2
Fig 2. Construction of uORF-connected transcripts from a uORF and a downstream protein-coding transcript.
The original protein-coding sequence is shown in green rectangles. For uORF-connected transcript #1, a splice junction (red bars) found in the GTEx collection of RNA-seq data is used to link the uORF to the second exon of the downstream transcript. For uORF-connected transcript #2, a splice donor (SD) site predicted by Splam (blue bars) is paired with an annotated splice acceptor (SA) site in a MANE transcript. The novel protein sequences are shown in red rectangles.
Fig 3
Fig 3. Novel isoforms were constructed using direct and predicted splicing evidence.
Splice junctions seen in ~10,000 GTEx RNA-seq and Splam predictions yielded a total of 4,185 uORF-connected transcripts from 1,035 uORFs. 2,282 were supported by GTEx data and 1,903 had Splam support but not GTEx evidence.
Fig 4
Fig 4. Conserved uORFs shared between GRCh38 and all subsets of four different genomes.
The innermost region shows that there were 1,931 uORFs conserved in all five genomes.
Fig 5
Fig 5. Examples of structure changes in novel protein variants identified in this study.
(A) and (B): alpha helix elongation at the SLC28A1 gene locus, where (A) shows the reference protein, ENST00000398637.10, and (B) shows the novel isoform, uorft_2119. The average pLDDT increase from A to B was 2.96. (C) and (D): straightening at the TRAK2 gene locus, where (C) shows the native protein, ENST00000430254.1, and (D) shows the novel isoform, uorft_441. The average pLDDT increase from C to D was 3.41. (E) and (F): tightening of a structure of TBRG4, where (E) shows the known protein, ENST00000395655.8, and (F) shows the novel isoform, uorft_1435. The main structural changes are highlighted by black boxes for each pair of structures. Red spheres represent the N-terminus of each protein. The average pLDDT increase from E to F was 4.29.

Update of

Similar articles

Cited by

References

    1. Amaral P, Carbonell-Sala S, De La Vega FM, Faial T, Frankish A, Gingeras T, et al.. The status of the human gene catalogue. Nature. 2023;622(7981):41–7. doi: 10.1038/s41586-023-06490-x - DOI - PMC - PubMed
    1. Varabyou A, Sommer MJ, Erdogdu B, Shinder I, Minkin I, Chao K- H, et al.. CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure. Genome Biology. 2023;24(1):249. doi: 10.1186/s13059-023-03088-4 - DOI - PMC - PubMed
    1. Mudge JM, Ruiz-Orera J, Prensner JR, Brunet MA, Calvet F, Jungreis I, et al.. Standardized annotation of translated open reading frames. Nature Biotechnology. 2022;40(7):994–9. doi: 10.1038/s41587-022-01369-0 - DOI - PMC - PubMed
    1. van Heesch S, Witte F, Schneider-Lunitz V, Schulz JF, Adami E, Faber AB, et al.. The translational landscape of the human heart. Cell. 2019;178(1):242–60. e29. doi: 10.1016/j.cell.2019.05.010 - DOI - PubMed
    1. Ji Z, Song R, Regev A, Struhl K. Many lncRNAs, 5’UTRs, and pseudogenes are translated and some are likely to express functional proteins. elife. 2015;4:e08890. doi: 10.7554/eLife.08890 - DOI - PMC - PubMed

LinkOut - more resources