Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Sep 13;28(5):dsab007.
doi: 10.1093/dnares/dsab007.

Understanding small ORF diversity through a comprehensive transcription feature classification

Affiliations
Review

Understanding small ORF diversity through a comprehensive transcription feature classification

Diego Guerra-Almeida et al. DNA Res. .

Abstract

Small open reading frames (small ORFs/sORFs/smORFs) are potentially coding sequences smaller than 100 codons that have historically been considered junk DNA by gene prediction software and in annotation screening; however, the advent of next-generation sequencing has contributed to the deeper investigation of junk DNA regions and their transcription products, resulting in the emergence of smORFs as a new focus of interest in systems biology. Several smORF peptides were recently reported in non-canonical mRNAs as new players in numerous biological contexts; however, their relevance is still overlooked in coding potential analysis. Hence, this review proposes a smORF classification based on transcriptional features, discussing the most promising approaches to investigate smORFs based on their different characteristics. First, smORFs were divided into non-expressed (intergenic) and expressed (genic) smORFs. Second, genic smORFs were classified as smORFs located in non-coding RNAs (ncRNAs) or canonical mRNAs. Finally, smORFs in ncRNAs were further subdivided into sequences located in small or long RNAs, whereas smORFs located in canonical mRNAs were subdivided into several specific classes depending on their localization along the gene. We hope that this review provides new insights into large-scale annotations and reinforces the role of smORFs as essential components of a hidden coding DNA world.

Keywords: alternative ORFs; dual functional RNA; genome annotation; long non-coding RNA; smORF peptides.

PubMed Disclaimer

Figures

Figure 1
Figure 1
smORF peptide biosynthesis. (A) smORF transcription, translation and cellular/extracellular trafficking. smORF peptide biosynthesis occurs directly via ribosome translation after smORF gene transcription. smORF peptides can play several roles inside and outside the cell. RNA polymerase in the nucleus is shown in green; ribosomes in the cytoplasm are shown in red; and smORF peptides in the cytoplasm are shown as blue winding lines. (B) Schematic representation of a hypothetical ORF. The illustrated ORF is a smORF within the first of the three RNA frames (Frame 1). The smORF is highlighted in bold font; the start codon is shown in green; the stop codon is shown in red; and the remaining codons are shown in blue. Above the smORF codons are their corresponding one-letter-code amino acids, encoding a hypothetical 11 amino acid smORF peptide.
Figure 2
Figure 2
Proposed smORF classification. (A) smORF classes and their representative locations. Hundreds of thousands of smORFs in the genome are non-expressed and are therefore classified as intergenic smORFs (green box). Expressed smORFs are classified as genic smORFs (green box) and are subdivided into smORFs located in non-coding RNAs (ncRNAs) (red box) and smORFs located in canonical mRNAs (blue box). Different types of ncRNAs and canonical mRNAs with their respective classes of smORFs are represented in red and blue boxes. Red tracks represent intergenic smORFs; blue tracks represent genic smORFs; yellow tracks represent large ORFs. (B) smORF classification chain. SmORF classes can be organized into groups and subgroups defined by transcriptional features.
Figure 3
Figure 3
General scheme of the smORF classification of misannotated non-coding transcripts. Strictly coding lncRNAs are reclassified as smORF located in mRNAs, while ncRNAs showing both coding and regulatory roles are reannotated as dual-function transcripts. Three smORF classes can be identified in ncRNAs: small refCDSs, smORFs in lncRNAs (dual functional) and smORFs in small RNAs (generally dual functional). Red tracks represent coding smORFs.
Figure 4
Figure 4
Comparison between large ORF mRNAs and smORF transcripts, which can occur in polycistronic arrangements. The yellow track represents large reference CDS; blue tracks represent small reference CDSs (coding smORFs). The lower panel represents a polycistronic mRNA containing three smORFs (blue tracks).
Figure 5
Figure 5
Mechanisms of isoformic smORF biosynthesis. Isoformic smORFs are generated via large ORF transcriptional editing, which fragments large CDSs into smaller variants that can fall within the smORF length limits. (A) Alternative splicing (AS) is the best described mechanism of isoformic smORF biosynthesis; however, other molecular processes, such as (B) alternative transcription initiation via alternative promoters, (C) alternative polyadenylation cleavage, (D) alternative refCDS translation mediated by downstream start codons and (E) the fragmentation of homologous pseudogenes, could theoretically generate isoformic smORFs. Black tracks represent exons; black lines represent introns; green lines represent ORF variants; red, yellow and blue circles indicate the respective processes on the left.
Figure 6
Figure 6
Location and distribution of alternative smORFs. (A) Canonical monocistronic mRNA paradigm comprising a unique large CDS between untranslated regions (UTRs). (B) Alternative smORF division and distribution within an mRNA. Upstream smORFs are located in the 5′UTR, and their stop codons may extend across the reference CDS region. The start codons of overlapping smORFs strictly overlap with the reference CDS region, but their sequences may extend to the 3′UTR. Downstream smORFs are located totally within the 3′UTR. (C) Examples of well-known representative mRNAs exhibiting several alternative smORFs (sequence analysis by the authors). Alternative smORFs are commonly encountered in mRNAs, and their coding potential is still underappreciated. Yellow tracks represent reference CDSs (large ORFs); blue tracks represent alternative smORFs.

References

    1. Tattersall A., Grant M.J.. 2016, Big data - what is it and why it matters, Health Info. Libr. J., 33, 89–91. - PubMed
    1. Mumtaz M.A.S., Couso J.P.. 2015, Ribosomal profiling adds new coding sequences to the proteome, Biochem. Soc. Trans., 43, 1271–6. - PubMed
    1. Patraquim P., Mumtaz M.A.S., Pueyo J.I., Aspden J.L., Couso J.-P.. 2020, Developmental regulation of canonical and small ORF translation from mRNAs, Genome Biol., 21, 128. - PMC - PubMed
    1. Wang Z., Gerstein M., Snyder M.. 2009, RNA-Seq: a revolutionary tool for transcriptomics, Nat. Rev. Genet., 10, 57–63. - PMC - PubMed
    1. Ramamurthi K.S., Storz G.. 2014, The small protein floodgates are opening; now the functional analysis begins, BMC Biol., 12, 96. - PMC - PubMed

MeSH terms