Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 7;49(15):e89.
doi: 10.1093/nar/gkab477.

smORFer: a modular algorithm to detect small ORFs in prokaryotes

Affiliations

smORFer: a modular algorithm to detect small ORFs in prokaryotes

Alexander Bartholomäus et al. Nucleic Acids Res. .

Abstract

Emerging evidence places small proteins (≤50 amino acids) more centrally in physiological processes. Yet, their functional identification and the systematic genome annotation of their cognate small open-reading frames (smORFs) remains challenging both experimentally and computationally. Ribosome profiling or Ribo-Seq (that is a deep sequencing of ribosome-protected fragments) enables detecting of actively translated open-reading frames (ORFs) and empirical annotation of coding sequences (CDSs) using the in-register translation pattern that is characteristic for genuinely translating ribosomes. Multiple identifiers of ORFs that use the 3-nt periodicity in Ribo-Seq data sets have been successful in eukaryotic smORF annotation. They have difficulties evaluating prokaryotic genomes due to the unique architecture (e.g. polycistronic messages, overlapping ORFs, leaderless translation, non-canonical initiation etc.). Here, we present a new algorithm, smORFer, which performs with high accuracy in prokaryotic organisms in detecting putative smORFs. The unique feature of smORFer is that it uses an integrated approach and considers structural features of the genetic sequence along with in-frame translation and uses Fourier transform to convert these parameters into a measurable score to faithfully select smORFs. The algorithm is executed in a modular way, and dependent on the data available for a particular organism, different modules can be selected for smORF search.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
General scheme of smORFer algorithm with its three modules that evaluate genomic information (module A, green), translation and 3-nt periodicity in the RPFs from Ribo-Seq data (module B, blue), and TIS from TIS-Ribo-Seq (module C, orange).
Figure 2.
Figure 2.
Metagene analysis of the genomic sequence periodicity across the 5′UTRs, CDSs and 3′UTRs of all protein-coding transcripts in E. coli (black), B. subtilis (blue) and S. aureus (red). ORFs are aligned at the start or stop codon, respectively. Note that the GC content differs among organisms and is 51% for E. coli, 44% for B. subtilis and 33% for S. aureus. Only non-overlapping protein-coding ORFs are considered. The horizonal dashed line denotes the average structure of a hypothetical genome with 50% GC content.
Figure 3.
Figure 3.
Fourier transform (FT) of the genomic 3-nt sequence periodicity in ORFs. (A) Averaged 3-nt sequence periodicity of protein-coding ORFs (upper panel), intergenic region (middle panel) and non-protein coding gene (e.g. 5S rRNA, lower panel). (B) The FT signal normalized by the ORF length at 3-nt sequence periodicity (left) and by the arithmetic mean of the signal between periodicity at 1.5 and 3 (right) for protein-coding ORFs. (C) Cumulative distributions of FT values (from B) for protein-coding ORFs. Vertical red line, cutoff of 3.
Figure 4.
Figure 4.
Detected smORFs for Ribo-Seq and TIS-Ribos-Seq data for E. coli and FT analysis of calibrated RPF counts. (A) Fraction of putative smORFs (dashed lines, left axis) and known smORFs (solid lines, right axis) (22) that are detected as translated and with genuine TIS with smORFer. Note the linear dependence of known smORFs that is caused by their different expression levels, while the putative smORF show a non-linear dependence. Red, translated (i.e. RPF counts); blue, with genuine TIS (i.e. RPF counts at TIS); black, both translated and with TIS counts. Vertical line denotes the cutoff ≥5 RPFs. (B) RPF reads plotted in full length (grey, left axis) for the first 1000 nt of the RNase I transcript compared to the calibrated RPF counts (black, right axis). (C) 3-nt periodicity FT signal of the calibrated RPFs for RNase I transcript. (D) Cumulative distributions of FT values for protein-coding ORFs. Vertical red line, cutoff of 2.
Figure 5.
Figure 5.
Examples of known and newly detected smORFs with smORFer in E. coli. (A) Examples of known and already experimentally verified smORFs (22) detected also with smORFer. (B) Examples of newly identified smORFs from each category, translated (upper panel) and 3nt-translated (lower panel). Lower panel: smORFer predicted two smORFs that differ only by their adjacent start codons. Since TIS-Seq counts are spread ± one codon around the start codon (Supplementary Figure S2), there is no clear-cut indication for a preferred start. smORF 25 has two consecutive start codons (both TTG) and is by one start codon longer than smORF 24, otherwise both smORFs are identical. (C) Complex example of smORFs overlapping with known ORFs illustrating strand-specificity of RPF and TIS-Seq counts, and precise identification of smORF translational start site. All 3 smORFs, including also the short yibX-S version of yibX, are detected by smORFer and experimentally verified in (22). Counts displayed as positive values of the y-axes represent counts of ORFs located on the forward DNA strand, and negatively displayed counts of ORFs on the reverse strand. (A–C) Blue, RPF counts from the Ribo-Seq (left axis); red, counts from the TIS-Seq (right axis). ORFs architecture is shown at the bottom: blue arrow, ORFs located on the forward strand; gray, ORFs located on the reverse strand.; nt, denotes the distance to the next ORF; two black dashes, designate truncated, not-completely displayed adjacent ORFs.

References

    1. Fickett J.W.Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982; 10:5303–5318. - PMC - PubMed
    1. Basrai M.A., Hieter P., Boeke J.D.. Small open reading frames: beautiful needles in the haystack. Genome Res. 1997; 7:768–771. - PubMed
    1. Maeda N., Kasukawa T., Oyama R., Gough J., Frith M., Engstrom P.G., Lenhard B., Aturaliya R.N., Batalov S., Beisel K.W.et al. .. Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLos Genet. 2006; 2:e62. - PMC - PubMed
    1. Angiuoli S.V., Gussman A., Klimke W., Cochrane G., Field D., Garrity G., Kodira C.D., Kyrpides N., Madupu R., Markowitz V.et al. .. Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation. OMICS. 2008; 12:137–141. - PMC - PubMed
    1. Ramamurthi K.S., Storz G.. The small protein floodgates are opening; now the functional analysis begins. BMC Biol. 2014; 12:96. - PMC - PubMed

Publication types