Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 May 18:2023.05.16.541049.
doi: 10.1101/2023.05.16.541049.

What can Ribo-seq and proteomics tell us about the non-canonical proteome?

Affiliations

What can Ribo-seq and proteomics tell us about the non-canonical proteome?

John R Prensner et al. bioRxiv. .

Update in

Abstract

Ribosome profiling (Ribo-seq) has proven transformative for our understanding of the human genome and proteome by illuminating thousands of non-canonical sites of ribosome translation outside of the currently annotated coding sequences (CDSs). A conservative estimate suggests that at least 7,000 non-canonical open reading frames (ORFs) are translated, which, at first glance, has the potential to expand the number of human protein-coding sequences by 30%, from ∼19,500 annotated CDSs to over 26,000. Yet, additional scrutiny of these ORFs has raised numerous questions about what fraction of them truly produce a protein product and what fraction of those can be understood as proteins according to conventional understanding of the term. Adding further complication is the fact that published estimates of non-canonical ORFs vary widely by around 30-fold, from several thousand to several hundred thousand. The summation of this research has left the genomics and proteomics communities both excited by the prospect of new coding regions in the human genome, but searching for guidance on how to proceed. Here, we discuss the current state of non-canonical ORF research, databases, and interpretation, focusing on how to assess whether a given ORF can be said to be "protein-coding".

In brief: The human genome encodes thousands of non-canonical open reading frames (ORFs) in addition to protein-coding genes. As a nascent field, many questions remain regarding non-canonical ORFs. How many exist? Do they encode proteins? What level of evidence is needed for their verification? Central to these debates has been the advent of ribosome profiling (Ribo-seq) as a method to discern genome-wide ribosome occupancy, and immunopeptidomics as a method to detect peptides that are processed and presented by MHC molecules and not observed in traditional proteomics experiments. This article provides a synthesis of the current state of non-canonical ORF research and proposes standards for their future investigation and reporting.

Highlights: Combined use of Ribo-seq and proteomics-based methods enables optimal confidence in detecting non-canonical ORFs and their protein products.Ribo-seq can provide more sensitive detection of non-canonical ORFs, but data quality and analytical pipelines will impact results.Non-canonical ORF catalogs are diverse and span both high-stringency and low-stringency ORF nominations.A framework for standardized non-canonical ORF evidence will advance the research field.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests

The authors declare no competing interests.

Figures

Figure 1:
Figure 1:. An overview of non-canonical ORF types and detection methods
A) A schematic illustrating the standardized names of non-canonical ORF types, their relationship to known mRNAs, and current estimations of their abundance. CDS, protein-coding sequence; uORF, upstream open reading frame (ORF); uoORF, upstream overlapping ORF; intORF, internal ORF; doORF, downstream overlapping ORF; dORF, downstream ORF; lncRNA-ORF, ORF residing within an annotated lncRNA. B) Generalized workflows for ribosome profiling (Ribo-seq), tryptic whole cell mass spectrometry, and HLA immunopeptidomics. The schematic indicates general properties of sample preparation for these data types.
Figure 2:
Figure 2:. Quality metrics of Ribo-seq and stringency of ORF calling
A) An illustration showing codon periodicity as a central metric of Ribo-seq library generation. Three illustrations indicate high-quality, borderline, and poor-quality Ribo-seq libraries. B) An illustration representing high-stringency and low-stringency ORF calling. In the top case, a small number of reads map the the 3’UTR of an annotated mRNA, and only two-thirds of those 3’UTR reads support the same reading frame of a potential dORF nomination. In the middle and bottom cases, a potential intORF has varying read support evidence. The middle case shows clear evidence of an intORF by a large increase in reads mapping to the +2 reading frame midway through the CDS. In the bottom case, there is a smaller change in the reads mapping to the +2 reading frame. C) Use of ribosome-stalling drug treatments to clarify translational start sites. Cultured cells are treated with homoharringtonine or lactimidomycin to stall ribosomes at the main translational start site of a given ORF, leading to a clearer resolution of the specific start codon.
Figure 3:
Figure 3:. ORF callers have different specialties and variable performance
A) Stacked bar plot displaying all detected ORF categories per ORF caller. For each, the percentage of unique ORFs shared between at least one, three, or six replicates is shown. Please note that these are relative contributions to the total number of ORFs. The absolute numbers of ORF identifications can be inferred from Figure 3C. B) Density plots displaying the distribution of ORF lengths in nucleotides (excluding the stop codon) for unique ORFs shared between at least one, three, or six replicates. C) Line graphs showing the numbers of unique ORFs detected by each tool shared between at least one, three or six replicates. The x-axis denotes the percentage of overlap used to consider two ORFs being similar or not, with 100% overlap meaning that the detected ORF was fully identical between [x] number of replicates. Please note that the total numbers of ORFs detected per algorithm (y-axis) can differ by an order of magnitude. These numbers are given for each line, with numbers reflecting the total ORFs with 100% similarity between replicates (i.e., the end of each curve). D) Genomic view of a short upstream ORF (uORF) in the STPBN1 gene indicating that ORF callers have variable affinity for certain types of ORFs. The top two tracks show the ribosomal P-site positions derived from the sequenced ribosome footprints, as processed independently from the sequencing data by the deterministic ORF caller ORFquant (top; red shading) and the probabilistic ORF caller PRICE (bottom; blue shading). The differently colored P-site bars indicate different reading frames (0, +1, +2) on the same transcript, with bars in the same color indicating a shared in-frame codon movement by the ribosome. For this visualization, newly found ORF variations of the annotated CDS that could be assigned to predicted non-coding RNA isoforms (e.g., transcript biotype: “processed_transcript”), but matched the CDS of SPTBN1 is not displayed. E) Genomic view of a near-cognate start codon ORF in TUG1. Image and track details as in (E) above.
Figure 4:
Figure 4:. An analysis of major non-canonical ORF databases
A) Here, each dot reflects a dataset, and the Y axis uses a log-10 scale to show the number of ORFs included that are >=16 amino acids long and contain an AUG start codon. The GENCODE catalog reflects the summation of the Ji et al. (19), Calviello et al. (61), Raj et al. (20), van Heesch et al. (9), Martinez et al. (21), Chen et al. (18) and Gaertner et al. (11) datasets as described in (16). B) The number of ORFs per dataset compared to the number of samples profiled by Ribo-seq. C) The number of ORFs per dataset compared to the number of unique cell types profiled by Riboseq D) The ratio of the number of ORFs per cell type compared to the number of ORFs per number of samples for each dataset. E) A bubble plot integrating the number of samples, number of different cell or tissue types, and the number of non-canonical ORFs found in each dataset.

References

    1. Aebersold R., Agar J. N., Amster I. J., Baker M. S., Bertozzi C. R., Boja E. S., Costello C. E., Cravatt B. F., Fenselau C., Garcia B. A., Ge Y., Gunawardena J., Hendrickson R. C., Hergenrother P. J., Huber C. G., Ivanov A. R., Jensen O. N., Jewett M. C., Kelleher N. L., Kiessling L. L., Krogan N. J., Larsen M. R., Loo J. A., Ogorzalek Loo R. R., Lundberg E., MacCoss M. J., Mallick P., Mootha V. K., Mrksich M., Muir T. W., Patrie S. M., Pesavento J. J., Pitteri S. J., Rodriguez H., Saghatelian A., Sandoval W., Schlüter H., Sechi S., Slavoff S. A., Smith L. M., Snyder M. P., Thomas P. M., Uhlén M., Van Eyk J. E., Vidal M., Walt D. R., White F. M., Williams E. R., Wohlschlager T., Wysocki V. H., Yates N. A., Young N. L., and Zhang B. (2018) How many human proteoforms are there? Nat. Chem. Biol. 14, 206–214 - PMC - PubMed
    1. Tress M. L., Abascal F., and Valencia A. (2017) Alternative Splicing May Not Be the Key to Proteome Complexity. Trends Biochem. Sci. 42, 98–110 - PMC - PubMed
    1. Blencowe B. J. (2017) The Relationship between Alternative Splicing and Proteomic Complexity. Trends in Biochemical Sciences. 42, 407–408 - PubMed
    1. Sinitcyn P., Richards A. L., Weatheritt R. J., Brademan D. R., Marx H., Shishkova E., Meyer J. G., Hebert A. S., Westphall M. S., Blencowe B. J., Cox J., and Coon J. J. (2023) Global detection of human variants and isoforms by deep proteome sequencing. Nat. Biotechnol. 10.1038/s41587-023-01714-x - DOI - PMC - PubMed
    1. Frankish A., Carbonell-Sala S., Diekhans M., Jungreis I., Loveland J. E., Mudge J. M., Sisu C., Wright J. C., Arnan C., Barnes I., Banerjee A., Bennett R., Berry A., Bignell A., Boix C., Calvet F., Cerdán-Vélez D., Cunningham F., Davidson C., Donaldson S., Dursun C., Fatima R., Giorgetti S., Giron C. G., Gonzalez J. M., Hardy M., Harrison P. W., Hourlier T., Hollis Z., Hunt T., James B., Jiang Y., Johnson R., Kay M., Lagarde J., Martin F. J., Gómez L. M., Nair S., Ni P., Pozo F., Ramalingam V., Ruffier M., Schmitt B. M., Schreiber J. M., Steed E., Suner M.-M., Sumathipala D., Sycheva I., Uszczynska-Ratajczak B., Wass E., Yang Y. T., Yates A., Zafrulla Z., Choudhary J. S., Gerstein M., Guigo R., Hubbard T. J. P., Kellis M., Kundaje A., Paten B., Tress M. L., and Flicek P. (2023) GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 - PMC - PubMed

Publication types