Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep;22(9):100631.
doi: 10.1016/j.mcpro.2023.100631. Epub 2023 Aug 11.

What Can Ribo-Seq, Immunopeptidomics, and Proteomics Tell Us About the Noncanonical Proteome?

Affiliations

What Can Ribo-Seq, Immunopeptidomics, and Proteomics Tell Us About the Noncanonical Proteome?

John R Prensner et al. Mol Cell Proteomics. 2023 Sep.

Abstract

Ribosome profiling (Ribo-Seq) has proven transformative for our understanding of the human genome and proteome by illuminating thousands of noncanonical sites of ribosome translation outside the currently annotated coding sequences (CDSs). A conservative estimate suggests that at least 7000 noncanonical ORFs are translated, which, at first glance, has the potential to expand the number of human protein CDSs by 30%, from ∼19,500 annotated CDSs to over 26,000 annotated CDSs. Yet, additional scrutiny of these ORFs has raised numerous questions about what fraction of them truly produce a protein product and what fraction of those can be understood as proteins according to conventional understanding of the term. Adding further complication is the fact that published estimates of noncanonical ORFs vary widely by around 30-fold, from several thousand to several hundred thousand. The summation of this research has left the genomics and proteomics communities both excited by the prospect of new coding regions in the human genome but searching for guidance on how to proceed. Here, we discuss the current state of noncanonical ORF research, databases, and interpretation, focusing on how to assess whether a given ORF can be said to be "protein coding."

Keywords: Ribo-Seq; immunopeptidomics; mass spectrometry; microprotein; noncanonical ORF.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest The authors declare no competing interests.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
An overview of noncanonical ORF types and detection methods.A, a schematic illustrating the standardized names of noncanonical ORF types, their relationship to known mRNAs, and current estimations of their abundance. B, generalized workflows for ribosome profiling (Ribo-Seq), tryptic proteome mass spectrometry, and human leukocyte antigen (HLA) immunopeptidomics. The schematic indicates general properties of sample preparation for these data types. CDS, coding sequence; dORF, downstream ORF; doORF, downstream overlapping ORF; intORF, internal ORF; lncRNA-ORF, ORF residing within an annotated lncRNA; uORF, upstream ORF; uoORF, upstream overlapping ORF.
Fig. 2
Fig. 2
Quality metrics of Ribo-Seq and stringency of ORF calling.A, an illustration showing codon periodicity as a central metric of Ribo-Seq library generation. Three illustrations indicate high-quality, borderline, and poor-quality Ribo-Seq libraries. B, an illustration representing high-stringency and low-stringency ORF calling. In the top case, a small number of reads map the the 3′UTR of an annotated mRNA, and only two-thirds of those 3′UTR reads support the same reading frame of a potential dORF nomination. In the middle and bottom cases, a potential intORF has varying read support evidence. The middle case shows clear evidence of an intORF by a large increase in reads mapping to the +2 reading frame midway through the CDS. In the bottom case, there is a smaller change in the reads mapping to the +2 reading frame. C, use of ribosome-stalling drug treatments to clarify translational start sites. Cultured cells are treated with homoharringtonine or lactimidomycin to stall ribosomes at the main translational start site of a given ORF, leading to a clearer resolution of the specific start codon. CDS, coding sequence; dORF, downstream ORF; intORF, internal ORF.
Fig. 3
Fig. 3
ORF callers have different specialties and variable performance.A, stacked bar plot displaying all detected ORF categories per ORF caller. For each, the percentage of unique ORFs shared between at least one, three, or six replicates is shown. Please note that these are relative contributions to the total number of ORFs. The absolute numbers of ORF identifications can be inferred from C. B, density plots displaying the distribution of ORF lengths in nucleotides (excluding the stop codon) for unique ORFs shared between at least one, three, or six replicates. C, line graphs showing the numbers of unique ORFs detected by each tool shared between at least one, three, or six replicates. The x-axis denotes the percentage of overlap used to consider two ORFs being similar or not, with 100% overlap meaning that the detected ORF was fully identical between [x] number of replicates. Please note that the total numbers of ORFs detected per algorithm (y-axis) can differ by an order of magnitude. These numbers are given for each line, with numbers reflecting the total ORFs with 100% similarity between replicates (i.e., the end of each curve). D, genomic view of a short upstream ORF (uORF) in the STPBN1 gene indicating that ORF callers have variable affinity for certain types of ORFs. The top two tracks show the ribosomal P-site positions derived from the sequenced ribosome footprints, as processed independently from the sequencing data by the deterministic ORF caller ORFquant (top; red shading) and the probabilistic ORF caller PRICE (bottom; blue shading). The differently colored P-site bars indicate different reading frames (0, +1, and +2) on the same transcript, with bars in the same color indicating a shared in-frame codon movement by the ribosome. For this visualization, newly found ORF variations of the annotated CDS that could be assigned to predicted noncoding RNA isoforms (e.g., transcript biotype: “processed_transcript”), but matched CDS of SPTBN1 is not displayed. E, genomic view of a near-cognate start codon ORF in TUG1. Image and track details as in (E) above. CDS, coding sequence.
Fig. 4
Fig. 4
An analysis of major noncanonical ORF databases.A, here, each dot reflects a dataset, and the Y-axis uses a log-10 scale to show the number of ORFs included that are ≥16 amino acids long and contain an AUG start codon. The GENCODE catalog reflects the summation of the studies by Ji et al. (19), Calviello et al. (61), Raj et al. (20), van Heesch et al. (9), Martinez et al. (21), Chen et al. (18) and Gaertner et al. (11) datasets as described (16). B, the number of ORFs per dataset compared with the number of samples profiled by Ribo-Seq. C, the number of ORFs per dataset compared with the number of unique cell types profiled by Ribo-Seq. D, the ratio of the number of ORFs per cell type compared with the number of ORFs per number of samples for each dataset. E, a bubble plot integrating the number of samples, number of different cell or tissue types, and the number of noncanonical ORFs found in each dataset.

Update of

References

    1. Aebersold R., Agar J.N., Amster I.J., Baker M.S., Bertozzi C.R., Boja E.S., et al. How many human proteoforms are there? Nat. Chem. Biol. 2018;14:206–214. - PMC - PubMed
    1. Tress M.L., Abascal F., Valencia A. Alternative splicing may not be the key to proteome complexity. Trends Biochem. Sci. 2017;42:98–110. - PMC - PubMed
    1. Blencowe B.J. The relationship between alternative splicing and proteomic complexity. Trends Biochem. Sci. 2017;42:407–408. - PubMed
    1. Sinitcyn P., Richards A.L., Weatheritt R.J., Brademan D.R., Marx H., Shishkova E., et al. Global detection of human variants and isoforms by deep proteome sequencing. Nat. Biotechnol. 2023 doi: 10.1038/s41587-023-01714-x. - DOI - PMC - PubMed
    1. Frankish A., Carbonell-Sala S., Diekhans M., Jungreis I., Loveland J.E., Mudge J.M., et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 2023;51:D942–D949. - PMC - PubMed

Publication types

LinkOut - more resources