Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 1;24(1):226.
doi: 10.1186/s12864-023-09311-7.

Small open reading frames: a comparative genetics approach to validation

Affiliations

Small open reading frames: a comparative genetics approach to validation

Niyati Jain et al. BMC Genomics. .

Abstract

Open reading frames (ORFs) with fewer than 100 codons are generally not annotated in genomes, although bona fide genes of that size are known. Newer biochemical studies have suggested that thousands of small protein-coding ORFs (smORFs) may exist in the human genome, but the true number and the biological significance of the micropeptides they encode remain uncertain. Here, we used a comparative genomics approach to identify high-confidence smORFs that are likely protein-coding. We identified 3,326 high-confidence smORFs using constraint within human populations and evolutionary conservation as additional lines of evidence. Next, we validated that, as a group, our high-confidence smORFs are conserved at the amino-acid level rather than merely residing in highly conserved non-coding regions. Finally, we found that high-confidence smORFs are enriched among disease-associated variants from GWAS. Overall, our results highlight that smORF-encoded peptides likely have important functional roles in human disease.

Keywords: Comparative genetics; Evolutionary conservation; Human genetic variation; Micropeptides; Small open reading frames.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview of smORF datasets. A Study workflow. B Venn diagram showing the overlap between the predictions from Chen et al. and Martinez et al. study in the filtered smORFs dataset. Venn diagram showing the overlap between the predictions from Chen et al. and Martinez et al. in the high-confidence smORFs dataset. Abbreviations: MOEUF, missense observed/expected upper bound fraction; GERP, Genome Evolutionary Rate Profiling; GWAS, genome-wide association studies
Fig. 2
Fig. 2
Selection of high-confidence smORFs. For selected RefSeq genes with varying pLI scores and amino acid length < 150 (left facets) and smORFs (right facets), violin and boxplots showing A N/S ratios, B LoF/S ratios, C MOEUF scores, and D GERP scores. MOEUF and GERP thresholds used to filter putative smORFs are shown by dashed red line. Selected RefSeq genes are segregated by pLI scores ranging from low (n = 400), moderate (n = 400) and high (n = 400) scores, and genes with less than 150 amino acids (n = 400). smORF subsets includes known smORFs (n = 28), putative smORFs unique to Chen et al. (n = 4,030) and Martinez et al. (n = 1,244) dataset, putative smORFs with exact matches reported by both Chen et al. and Martinez et al. (n = 515), and smORFs in both datasets with imperfect overlap (n = 739). Box plots display the first quartile, median and third quartile. Abbreviations: N, nonsynonymous; S, synonymous; LoF, loss-of-function; pLI, probability of loss-of-function intolerant
Fig. 3
Fig. 3
Conservation of high-confidence smORFs at the protein-coding level. A Violin and box plots showing the distribution of N/S ratios of the correct (blue) and “incorrect” (grey) reading frames of RefSeq genes (n = 1200) and high-confidence smORFs (n = 2,891), with a lower N/S ratio observed in the correct reading. B Violin plot and boxplot showing that the correct reading frame (blue) had lower MOEUF scores compared to all five “incorrect” reading frames (grey) for both RefSeq genes (n = 1200) and high-confidence smORFs (n = 2,891). Box plots displaying the first quartile, median and third quartile. All P values are based on the paired Wilcoxon signed rank test
Fig. 4
Fig. 4
Enrichment of GWAS variants within smORFs. Permutation analysis testing enrichment of SNVs associated with disease/ other traits within smORFs segregated by MAF bins (facets) revealed a statistically significant enrichment among smORFs (Fisher’s method meta-analysis P = 4.96 × 10–5). P values were calculated by comparing the number of observed overlapping SNVs (red) to the number expected based on 10,000 permutations (grey histograms). Meta-analysis using Fisher’s method, H0: no significant enrichment of GWAS SNVs in smORFs. Abbreviation: MAF, minor allele frequency

References

    1. Martinez TF, Chu Q, Donaldson C, Tan D, Shokhirev MN, Saghatelian A. Accurate annotation of human protein-coding small open reading frames. Nat Chem Biol. 2020;16(4):458–468. doi: 10.1038/s41589-019-0425-0. - DOI - PMC - PubMed
    1. Couso JP. Finding smORFs: getting closer. Genome Biol. 2015;16:189. doi: 10.1186/s13059-015-0765-3. - DOI - PMC - PubMed
    1. Basrai MA, Hieter P, Boeke JD. Small open reading frames: beautiful needles in the haystack. Genome Res. 1997;7(8):768–771. doi: 10.1101/gr.7.8.768. - DOI - PubMed
    1. Chen J, Brunner AD, Cogan JZ, Nunez JK, Fields AP, Adamson B, Itzhak DN, Li JY, Mann M, Leonetti MD, et al. Pervasive functional translation of noncanonical human open reading frames. Science. 2020;367(6482):1140–1146. doi: 10.1126/science.aay0262. - DOI - PMC - PubMed
    1. Stein CS, Jadiya P, Zhang X, McLendon JM, Abouassaly GM, Witmer NH, Anderson EJ, Elrod JW, Boudreau RL. Mitoregulin: A lncRNA-Encoded Microprotein that Supports Mitochondrial Supercomplexes and Respiratory Efficiency. Cell Rep. 2018;23(13):3710–3720e3718. doi: 10.1016/j.celrep.2018.06.002. - DOI - PMC - PubMed

LinkOut - more resources