Review

. 2022 Mar 17;29(1):19.

doi: 10.1186/s12929-022-00802-5.

Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures

Alyssa Zi-Xin Leong¹, Pey Yee Lee¹, M Aiman Mohtar¹, Saiful Effendi Syafruddin¹, Yuh-Fen Pung², Teck Yew Low³

Affiliations

¹ UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000, Kuala Lumpur, Malaysia.
² Division of Biomedical Science, School of Pharmacy, University of Nottingham Malaysia, Semenyih, 43500, Selangor, Malaysia.
³ UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000, Kuala Lumpur, Malaysia. lowteckyew@ppukm.ukm.edu.my.

PMID: 35300685
PMCID: PMC8928697
DOI: 10.1186/s12929-022-00802-5

Review

Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures

Alyssa Zi-Xin Leong et al. J Biomed Sci. 2022.

. 2022 Mar 17;29(1):19.

doi: 10.1186/s12929-022-00802-5.

Authors

Alyssa Zi-Xin Leong¹, Pey Yee Lee¹, M Aiman Mohtar¹, Saiful Effendi Syafruddin¹, Yuh-Fen Pung², Teck Yew Low³

Affiliations

¹ UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000, Kuala Lumpur, Malaysia.
² Division of Biomedical Science, School of Pharmacy, University of Nottingham Malaysia, Semenyih, 43500, Selangor, Malaysia.
³ UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000, Kuala Lumpur, Malaysia. lowteckyew@ppukm.ukm.edu.my.

PMID: 35300685
PMCID: PMC8928697
DOI: 10.1186/s12929-022-00802-5

Abstract

A short open reading frame (sORFs) constitutes ≤ 300 bases, encoding a microprotein or sORF-encoded protein (SEP) which comprises ≤ 100 amino acids. Traditionally dismissed by genome annotation pipelines as meaningless noise, sORFs were found to possess coding potential with ribosome profiling (RIBO-Seq), which unveiled sORF-based transcripts at various genome locations. Nonetheless, the existence of corresponding microproteins that are stable and functional was little substantiated by experimental evidence initially. With recent advancements in multi-omics, the identification, validation, and functional characterisation of sORFs and microproteins have become feasible. In this review, we discuss the history and development of an emerging research field of sORFs and microproteins. In particular, we focus on an array of bioinformatics and OMICS approaches used for predicting, sequencing, validating, and characterizing these recently discovered entities. These strategies include RIBO-Seq which detects sORF transcripts via ribosome footprints, and mass spectrometry (MS)-based proteomics for sequencing the resultant microproteins. Subsequently, our discussion extends to the functional characterisation of microproteins by incorporating CRISPR/Cas9 screen and protein-protein interaction (PPI) studies. Our review discusses not only detection methodologies, but we also highlight on the challenges and potential solutions in identifying and validating sORFs and their microproteins. The novelty of this review lies within its validation for the functional role of microproteins, which could contribute towards the future landscape of microproteomics.

Keywords: Mass spectrometry; Microproteins; Proteogenomics; Ribosome profiling (RIBO-Seq); Short open reading frame (sORF); Small open reading frame (smORF).

PubMed Disclaimer

Conflict of interest statement

The authors have declared no conflict of interest.

Figures

**Fig. 1**
A comparison between sORF and altORF transcripts in terms of length and initiation codons. A sORF transcript structure with AUG or non-AUG initiation codons, characterised by its short length of 100 codons after post-transcriptional modifications. B altORF transcript structure described with AUG initiation codon, longer than 30 codons and without an upper limit on length, differing from sORFs

**Fig. 2**
Localities of sORFs in the genome and transcripts. Genomic locations of sORFs include in the 3’ UTR (uORF), 5’ UTR (dORF), overlapping within the main ORF, intergenic regions and pseudogenes. sORF-containing long intergenic non-coding RNA (lincRNA) are also localised in the nucleus. In the mitochondria, sORFs are found in the mitochondrial DNA (mtDNA). In the cytoplasm, sORFs are scattered across different RNA transcripts i.e., circular RNA (circRNA), long non-coding RNA (lncRNA), and pri-microRNA

**Fig. 3**
Ribosome profiling process where ribosome footprints are obtained for deep sequencing. Isolation of ribosome-bound mRNAs is conducted through treatment of non-specific nucleases such as RNase I or micrococcal nuclease). Ribosome footprints (showing positioning between start and stop codon of gene) are then used for library generation and deep sequencing. Identification of novel small peptides made possible by isolation of actively translated regions of the transcript, which is directly mapped back to genomic coding regions

**Fig. 4**
Mass-spectrometry based approaches to isolate microproteins. Sample preparation prior to LC–MS/MS analysis to isolate microprotein species < 30 kDa in size includes size exclusion approaches. Molecular weight cut off filters (MWCOs) can sieve for microproteins depending on the type of filter used i.e., 10 kDa or 30 kDa. Acid precipitation is a common enrichment step for to precipitate larger proteins. Solid phase extraction (SPE) enrichment occurs via reverse-phase C8 cartridges and elutes microproteins of interest. Further methods in reducing sample complexities include electrostatic repulsion-hydrophilic interaction chromatography (ERLIC) and high-resolution isoelectric focusing (Hi-RIEF). ERLIC separates based on charged analytes and utilises SAX resin for strong anion exchange, whereas Hi-RIEF seperates peptides based on their isoelectric points (pI) on a pH gradient gel. Post-fractionation accuracy is dependent on high sequence coverage and low background noise in mass spectra. This can be achieved with using High-energy Collision Induced Dissociation (HCD) on Fusion Tribrid MS or Q-Exactive MS

See this image and copyright information in PMC

References

1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
1. Gates AJ, Gysi DM, Kellis M, Barabási A-L. A wealth of discovery built on the Human Genome Project—by the numbers. Nature. 2021;590:212–215. - PubMed
1. Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 2001 [cited 2021 Apr 15]. p. 425–8. https://linkinghub.elsevier.com/retrieve/pii/S0168952501023721. Accessed 15 Apr 2021. - PubMed
1. Cheng H, Soon Chan W, Li Z, Wang D, Liu S, Zhou Y. Small open reading frames: current prediction techniques and future prospect. Curr Protein Pept Sci. 2011;12:503–507. - PMC - PubMed
1. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 2002;420:563–573. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

GUP-2020-078/Universiti Kebangsaan Malaysia

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures

Affiliations

Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources