Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Feb 8:8:47.
doi: 10.1186/1471-2105-8-47.

Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes

Affiliations

Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes

Yuko Makita et al. BMC Bioinformatics. .

Abstract

Background: Computational prediction methods are currently used to identify genes in prokaryote genomes. However, identification of the correct translation initiation sites remains a difficult task. Accurate translation initiation sites (TISs) are important not only for the annotation of unknown proteins but also for the prediction of operons, promoters, and small non-coding RNA genes, as this typically makes use of the intergenic distance. A further problem is that most existing methods are optimized for Escherichia coli data sets; applying these methods to newly sequenced bacterial genomes may not result in an equivalent level of accuracy.

Results: Based on a biological representation of the translation process, we applied Bayesian statistics to create a score function for predicting translation initiation sites. In contrast to existing programs, our combination of methods uses supervised learning to optimally use the set of known translation initiation sites. We combined the Ribosome Binding Site (RBS) sequence, the distance between the translation initiation site and the RBS sequence, the base composition of the start codon, the nucleotide composition (A-rich sequences) following start codons, and the expected distribution of the protein length in a Bayesian scoring function. To further increase the prediction accuracy, we also took into account the operon orientation. The outcome of the procedure achieved a prediction accuracy of 93.2% in 858 E. coli genes from the EcoGene data set and 92.7% accuracy in a data set of 1243 Bacillus subtilis 'non-y' genes. We confirmed the performance in the GC-rich Gamma-Proteobacteria Herminiimonas arsenicoxydans, Pseudomonas aeruginosa, and Burkholderia pseudomallei K96243.

Conclusion: Hon-yaku, being based on a careful choice of elements important in translation, improved the prediction accuracy in B. subtilis data sets and other bacteria except for E. coli. We believe that most remaining mispredictions are due to atypical ribosomal binding sequences used in specific translation control processes, or likely errors in the training data sets.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Relationship between the size of training data set and the accuracy. The x-axis shows the size of the training data set. The leftmost data point corresponds to the leave-one-out analysis based on the full data set of 857 genes in E. coli and 1242 genes in B. subtilis. For the other data points, we created the training data set of the given size by randomly selecting genes from the full data set.
Figure 2
Figure 2
Distance distribution from the end of RBS sequence to the translation initiation sites.
Figure 3
Figure 3
Distribution of protein length ratio.

Similar articles

Cited by

References

    1. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL. Improved microbial gene identification with GLIMMER. Nucleic Acids Research. 1999;27:4636–41. doi: 10.1093/nar/27.23.4636. - DOI - PMC - PubMed
    1. Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Research. 2001;29:2607–18. doi: 10.1093/nar/29.12.2607. - DOI - PMC - PubMed
    1. Trotot P, Sismeiro O, Vivares C, Glaser P, Bresson-Roy A, Danchin A. Comparative analysis of the cya locus in enterobacteria and related gram-negative facultative anaerobes. Biochimie. 1996;78:277. doi: 10.1016/0300-9084(96)82192-4. - DOI - PubMed
    1. Medigue C, Wong B, Lin M, Bocs S, Danchin A. The secE gene of Helicobacter pylori. J Bacteriol. 2002;184:2837. doi: 10.1128/JB.184.10.2837-2840.2002. - DOI - PMC - PubMed
    1. Moreno-Hagelsieb G, Collado-Vides J. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics. 2002. pp. S329–36. - PubMed

Publication types

MeSH terms

LinkOut - more resources