Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 7;38(5):1198-1207.
doi: 10.1093/bioinformatics/btab827.

No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study

Affiliations

No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study

Nicholas J Dimonaco et al. Bioinformatics. .

Abstract

Motivation: The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis.

Results: We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations.

Availability and implementation: Code and datasets for reproduction and customisation are available at https://github.com/NickJD/ORForise.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Illustration of how predicted CDSs are classified as having detected or not detected the CEA genes. Predicted CDSs are compared to the genes held in Ensembl. (A) The predicted CDS covers at least 75% and is in-frame with Ensembl gene and therefore it is recorded as detected. (B) The predicted CDS covers <75% of the Ensembl gene and therefore is recorded as not detected. (C) The predicted CDS covers part of an Ensembl gene but is out of frame (dotted outline) and therefore is recorded as missed. (D) The use of alternative stop codons causes the predicted CDS to be truncated or divided into two CDSs that span the Ensembl genes and therefore is recorded as missed
Fig. 2.
Fig. 2.
The result of all 15 gene prediction tools (21 with chosen models) on the 6 MO genomes, ordered by the summed ranks across the 12 metrics. The Y axis represents the Percentage of Genes Detected (M1) by each tool in black and the Percentage of Perfect Matches (M5) in white. M5, which represents the ability for a tool to detect the correct start codon, has more variance between the tools than M1. Each column on the X axis represents a different tool (some model-based tools were run multiple times). There is considerable variation in how well each tool performs across the different genomes, while all tools perform relatively poorly on the M.genitalium genome

References

    1. Al-Turaiki, Israa M., et al. "Computational approaches for gene prediction: a comparative survey." International Conference on Informatics Engineering and Information Science. Springer, Berlin, Heidelberg, 2011.
    1. Andrews S.J., Rothnagel J.A. (2014) Emerging evidence for functional peptides encoded by short open reading frames. Nat. Rev. Genet., 15, 193–204. - PubMed
    1. Badger J.H., Olsen G.J. (1999) CRITICA: coding region identification tool invoking comparative analysis. Mol. Biol. Evol., 16, 512–524. - PubMed
    1. Baranov P.V. et al. (2015) Augmented genetic decoding: global, local and temporal alterations of decoding processes and codon meaning. Nat. Rev. Genet., 16, 517–529. - PubMed
    1. Bartholomäus A. et al. (2021) smORFer: a modular algorithm to detect small ORFs in prokaryotes. Nucleic Acids Res., 49, e89. - PMC - PubMed

Publication types