Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes
- PMID: 33429904
- PMCID: PMC7827183
- DOI: 10.3390/microorganisms9010129
Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes
Abstract
One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).
Keywords: annotation; clustering; gene; genome; machine learning; phage; prediction.
Conflict of interest statement
The authors declare no conflict of interest.
Figures





Similar articles
-
Multivariate entropy distance method for prokaryotic gene identification.J Bioinform Comput Biol. 2004 Jun;2(2):353-73. doi: 10.1142/s0219720004000624. J Bioinform Comput Biol. 2004. PMID: 15297987
-
PHANOTATE: a novel approach to gene identification in phage genomes.Bioinformatics. 2019 Nov 1;35(22):4537-4542. doi: 10.1093/bioinformatics/btz265. Bioinformatics. 2019. PMID: 31329826 Free PMC article.
-
[Comprehensive re-annotation of protein-coding genes for prokaryotic genomes by Z-curve and similarity-based methods].Yi Chuan. 2020 Jul 20;42(7):691-702. doi: 10.16288/j.yczz.20-022. Yi Chuan. 2020. PMID: 32694108 Chinese.
-
GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences.Gene. 2005 Feb 14;346:115-25. doi: 10.1016/j.gene.2004.10.018. Epub 2005 Jan 26. Gene. 2005. PMID: 15716020
-
Reconsidering proteomic diversity with functional investigation of small ORFs and alternative ORFs.Exp Cell Res. 2020 Aug 1;393(1):112057. doi: 10.1016/j.yexcr.2020.112057. Epub 2020 May 6. Exp Cell Res. 2020. PMID: 32387289 Review.
Cited by
-
Analysis of RNA translation with a deep learning architecture provides new insight into translation control.bioRxiv [Preprint]. 2024 Jul 2:2023.07.08.548206. doi: 10.1101/2023.07.08.548206. bioRxiv. 2024. Update in: Nucleic Acids Res. 2025 Apr 10;53(7):gkaf277. doi: 10.1093/nar/gkaf277. PMID: 39005319 Free PMC article. Updated. Preprint.
-
Multiomic Analysis of Environmental Effects and Nitrogen Use Efficiency of Two Potato Varieties Under High Nitrogen Conditions.Plants (Basel). 2025 Feb 20;14(5):633. doi: 10.3390/plants14050633. Plants (Basel). 2025. PMID: 40094559 Free PMC article.
-
Special Issue "Bacteriophage Genomics": Editorial.Microorganisms. 2023 Mar 8;11(3):693. doi: 10.3390/microorganisms11030693. Microorganisms. 2023. PMID: 36985265 Free PMC article.
-
MultiPhATE2: code for functional annotation and comparison of phage genomes.G3 (Bethesda). 2021 May 7;11(5):jkab074. doi: 10.1093/g3journal/jkab074. G3 (Bethesda). 2021. PMID: 33734357 Free PMC article.
-
Analysis of RNA translation with a deep learning architecture provides new insight into translation control.Nucleic Acids Res. 2025 Apr 10;53(7):gkaf277. doi: 10.1093/nar/gkaf277. Nucleic Acids Res. 2025. PMID: 40219965 Free PMC article.
References
-
- Fiers W., Contreras R., Duerinck F., Haegeman G., Iserentant D., Merregaert J., Jou W.M., Molemans F., Raeymaekers A., Berghe A.V.D., et al. Complete nucleotide sequence of bacteriophage MS2 RNA: Primary and secondary structure of the replicase gene. Nat. Cell Biol. 1976;260:500–507. doi: 10.1038/260500a0. - DOI - PubMed
-
- Borodovsky M., McIninch J. GENMARK: Parallel gene recognition for both DNA strands. Comput. Chem. 1993;17:123–133. doi: 10.1016/0097-8485(93)85004-V. - DOI
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous