Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Katelyn McNair¹, Carol L Ecale Zhou², Brian Souza³, Stephanie Malfatti³, Robert A Edwards^{1

4}

Affiliations

¹ Computational Sciences Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182, USA.
² Lawrence Livermore National Laboratory, Global Security Computing Applications, Livermore, CA 94550, USA.
³ Lawrence Livermore National Laboratory, Biological Sciences Research Division, Livermore, CA 94550, USA.
⁴ College of Science and Engineering, Flinders University, Bedford Park, SA 5042, Australia.

PMID: 33429904
PMCID: PMC7827183
DOI: 10.3390/microorganisms9010129

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Katelyn McNair et al. Microorganisms. 2021.

. 2021 Jan 8;9(1):129.

doi: 10.3390/microorganisms9010129.

Authors

Katelyn McNair¹, Carol L Ecale Zhou², Brian Souza³, Stephanie Malfatti³, Robert A Edwards^{1

4}

Affiliations

¹ Computational Sciences Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182, USA.
² Lawrence Livermore National Laboratory, Global Security Computing Applications, Livermore, CA 94550, USA.
³ Lawrence Livermore National Laboratory, Biological Sciences Research Division, Livermore, CA 94550, USA.
⁴ College of Science and Engineering, Flinders University, Bedford Park, SA 5042, Australia.

PMID: 33429904
PMCID: PMC7827183
DOI: 10.3390/microorganisms9010129

Abstract

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

Keywords: annotation; clustering; gene; genome; machine learning; phage; prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Visualizing the amino acid composition of open reading frames. (A) Comparison of average amino acid occurrence across 14,179 phage genomes. Points correspond to the 20 different amino acids and are labeled according to their International Union of Pure and Applied Chemistry (IUPAC) single letter abbreviations. The observed frequencies come from the annotated genome consensus gene calls, while the expected come from the overall codon probabilities calculated from the GC content. Amino acids above the diagonal identity line occur more frequently than expected in coding open reading frames (ORFs), and those below it occur less frequently than expected, which alludes to a coding bias signal. (B) The averaged amino acid frequencies of coding ORFs change based on the GC content. The previous consensus calls were averaged for each genome and then plotted using principle component analysis (PCA), and are colored based on the GC content of the genome. The (red) lower GC content genomes tend to favor the amino acids (FYNKI) with AT-rich codons, while the yellow high-GC content genomes tend to favor the amino acids (PRAGW) with GC-rich codons.

**Figure 2**
Flowchart of the GOODORFS workflow. After supplying GOODORFS with a fasta file that contains the genome in question, the four major steps are finding the ORFs, calculating the EDPs of the ORFs, clustering the EDPs, and choosing the cluster that contains the good (coding) ORFs.

**Figure 3**
(A) The amino acid frequency EDPs of coding and noncoding ORFs for the representative genome *Caulobacter* phage cluster separately. All potential ORFs were found, taking only the longest (i.e., the first outermost available start codon) truncation, finding the amino acid frequencies, coloring them according to whether they are in the consensus annotations (coding), and then plotting them in a PCA. (B) The same potential ORFs from the previous figure, except with the noncoding ORFs colored according to their offset in relation to the coding frame (0−, 1, 2, 1−, 2−), or intergenic (IG) if they do not overlap with a coding ORF. The projections for the amino-acids are labeled according to the single letter abbreviations, while the three stop codons amber (+), ochre (#), umber (*) are labeled according to their symbols.

**Figure 4**
Two examples of phage genomes where the amino acid frequencies of coding and noncoding ORFs do not follow the general trend of clustering separately. Shown here are unique ORFs for (A) the filamentous phage *Ralstonia* RSM1 and (B) *Escherichia* phage fp01. Other filamentous phages show the same lack of observable coding bias, which could be due to the small genome size; however, it is clear that the *Escherichia* phage does not use the Standard genetic code. The projections for the amino-acids are labeled according to the single letter abbreviations, while the three stop codons amber (+), ochre (#), umber (*) are labeled according to their symbols.

**Figure 5**
Comparison of the F1 scores for all 14,179 genomes between GOODORFS and (A) LONGORFS, (B) MED2 and (C) PHANOTATE’s training set creation steps. (D) Prodigal’s training set creation step. In each panel the dotted line represents x = y, and so points above and to the left of the line represent more accurate protein-encoding gene identification by GOODORFs, while points to the lower/right of the line indicate less accurate protein-encoding gene identification. Points on the line indicate agreement between the algorithms.

See this image and copyright information in PMC

References

1. Fiers W., Contreras R., Duerinck F., Haegeman G., Iserentant D., Merregaert J., Jou W.M., Molemans F., Raeymaekers A., Berghe A.V.D., et al. Complete nucleotide sequence of bacteriophage MS2 RNA: Primary and secondary structure of the replicase gene. Nat. Cell Biol. 1976;260:500–507. doi: 10.1038/260500a0. - DOI - PubMed
1. Fleischmann R.D., Adams M.D., White O., Clayton R.A., Kirkness E.F., Kerlavage A.R., Bult C.J., Tomb J.F., Dougherty B.A., Merrick J.M., et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. doi: 10.1126/science.7542800. - DOI - PubMed
1. Borodovsky M., McIninch J. GENMARK: Parallel gene recognition for both DNA strands. Comput. Chem. 1993;17:123–133. doi: 10.1016/0097-8485(93)85004-V. - DOI
1. Salzberg S.L., Delcher A.L., Kasif S., White O. Microbial gene identification using interpolated Markov models. Nucl. Acids Res. 1998;26:544–548. doi: 10.1093/nar/26.2.544. - DOI - PMC - PubMed
1. Badger J.H., Olsen G.J. CRITICA: Coding region identification tool invoking comparative analysis. Mol. Biol. Evol. 1999;16:512–524. doi: 10.1093/oxfordjournals.molbev.a026133. - DOI - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Affiliations

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous