Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 27;51(21):11504-11517.
doi: 10.1093/nar/gkad814.

StORF-Reporter: finding genes between genes

Affiliations

StORF-Reporter: finding genes between genes

Nicholas J Dimonaco et al. Nucleic Acids Res. .

Abstract

Large regions of prokaryotic genomes are currently without any annotation, in part due to well-established limitations of annotation tools. For example, it is routine for genes using alternative start codons to be misreported or completely omitted. Therefore, we present StORF-Reporter, a tool that takes an annotated genome and returns regions that may contain missing CDS genes from unannotated regions. StORF-Reporter consists of two parts. The first begins with the extraction of unannotated regions from an annotated genome. Next, Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are open reading frames that are delimited by stop codons and thus can capture those genes most often missing in genome annotations. We show this methodology recovers genes missing from canonical genome annotations. We inspect the results of the genomes of model organisms, the pangenome of Escherichia coli, and a set of 5109 prokaryotic genomes of 247 genera from the Ensembl Bacteria database. StORF-Reporter extended the core, soft-core and accessory gene collections, identified novel gene families and extended families into additional genera. The high levels of sequence conservation observed between genera suggest that many of these StORFs are likely to be functional genes that should now be considered for inclusion in canonical annotations.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
Visual representation of how unannotated regions (URs) are selected for extraction. URs that are less than 30 nt are not extracted. URs are extracted with an additional 50 nt on their 5’ and 3’ ends to allow for overlapping genes and the upstream untranslated region between the first stop codon and the true start codon.
Figure 2.
Figure 2.
Visual representation of a StORF and how it can capture multiple potential start codons in an unannotated region. Part A depicts a StORF capturing two possible start positions (GTG and ATG) for a CDS gene which could produce two distinct CDS sequences (CDS1 and CDS2). The dotted segment of the StORF represents the untranslated part of the sequence. Part B shows how a StORF can comprise of only a partial segment of a gene if that gene either recodes a canonical stop codon (Stop to trp) or has had an in-frame stop codon mutation resulting in one complete and one truncated transcript (CDS3 and CDS4). StORF-Reporter can be used to find these consecutive StORFs.
Figure 3.
Figure 3.
This double plot reports the analysis of the 6 model organisms which were used during the parameterisation of StORF-Reporter. Figure A reports the distributions of the Ensembl annotated CDS gene overlap lengths for each model organism with a dotted red line representing the overall median value of 4 nt, with the x-axis truncated at 100 nt. Figure B reports the distance in nucleotides between an Ensembl gene’s start codon and the first in-frame upstream stop codon for each model organism, with the x-axis truncated at 500 nt. The dotted red line represents the overall median value of 39 nt. This analysis indicates that an extension size of 50 nt from each end of the extracted unannotated regions is large enough to capture both the true overlap between an annotated gene and the putative gene identified by a StORF and the small amount of upstream non-coding DNA which the StORF will contain.
Figure 4.
Figure 4.
Clustal Omega multiple sequence alignment of the two Ensembl representative sequences, VED12192 and OKB89195, which were combined in E. coli pangenome cluster #13 287 by the additional StORF sequence. This combination is done because the StORF sequence extends passed VED12192 begins and to where OKB89195 ends. These two sequences on their own do not align together for long enough for the clustering parameters, thus the reason they originally formed independent clusters. Additionally, they were also annotated as different proteins by Ensembl.

References

    1. Sela I., Wolf Y.I., Koonin E.V.. Theory of prokaryotic genome evolution. Proc. Natl. Acad. Sci. U.S.A. 2016; 113:11399–11407. - PMC - PubMed
    1. Dimonaco N.J., Aubrey W., Kenobi K., Clare A., Creevey C.J.. No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study. Bioinformatics. 2021; 38:1198–1207. - PMC - PubMed
    1. Taft R.J., Pheasant M., Mattick J.S.. The relationship between non-protein-coding DNA and eukaryotic complexity. Bioessays. 2007; 29:288–299. - PubMed
    1. Hemm M.R., Paul B.J., Schneider T.D., Storz G., Rudd K.E.. Small membrane proteins found by comparative genomics and ribosome binding site models. Mol. Microbiol. 2008; 70:1487–1501. - PMC - PubMed
    1. Sridhar J., Sabarinathan R., Balan S.S., Rafi Z.A., Gunasekaran P., Sekar K.. Junker: an intergenic explorer for bacterial genomes. Genomics Proteomics Bioinformatics. 2011; 9:179–182. - PMC - PubMed

Publication types

MeSH terms