Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Dec;15(6):387-96.
doi: 10.1093/dnares/dsn027. Epub 2008 Oct 21.

MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes

Affiliations

MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes

Hideki Noguchi et al. DNA Res. 2008 Dec.

Abstract

Recent advances in DNA sequencers are accelerating genome sequencing, especially in microbes, and complete and draft genomes from various species have been sequenced in rapid succession. Here, we present a comprehensive gene prediction tool, the MetaGeneAnnotator (MGA), which precisely predicts all kinds of prokaryotic genes from a single or a set of anonymous genomic sequences having a variety of lengths. The MGA integrates statistical models of prophage genes, in addition to those of bacterial and archaeal genes, and also uses a self-training model from input sequences for predictions. As a result, the MGA sensitively detects not only typical genes but also atypical genes, such as horizontally transferred and prophage genes in a prokaryotic genome. In this paper, we also propose a novel approach for analyzing the ribosomal binding site (RBS), which enables us to detect species-specific patterns of the RBSs. The MGA has the ingenious RBS model based on this approach, and precisely predicts translation starts of genes. The MGA also succeeds in improving prediction accuracies for short sequences by using the adapted RBS models (96% sensitivity and 93% specificity for 700 bp fragments). These features of the MGA expedite wide ranges of microbial genome studies, such as genome annotations and metagenome analyses.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A schematic diagram of the MGA algorithm. (A) Prediction protocol of the MGA. (B) ORF-by-ORF procedure.
Figure 2
Figure 2
Statistics of prophage genes. (A) Frequency distributions of gene lengths in prokaryote and prophage. (B) Proportions of the consecutive gene arrangements in prokaryote and prophage.
Figure 3
Figure 3
The average RBS map. The horizontal axis represents relative positions from the start codons [equal to –(spacer length+1)], and the vertical axis represents motif numbers.
Figure 4
Figure 4
The clustering result of the RBS maps derived from 229 of 591 prokaryotic genomes (one species per genus).
Figure 5
Figure 5
The RBS maps for four species. (A) Helicobacter pylori (B) Clostridium acetobutylicum (C) Methanobrevibacter smithii (D) Prochlorococcus marinus.
Figure 6
Figure 6
Prediction performances of gene finders on the MetaGene dataset. (A) Accuracy comparisons of the MGA, GeneMarkS and Glimmer3. In the Glimmer3 prediction, a script ‘g3-iterated.csh’ is used. (B) Accuracy comparisons of the MGA and MG. In the MGA prediction, two different running options, which treat multiple input sequences individually (MGA) or as a unit (MGA-s), are used. (C) Relationship between accuracies and number of 40 kb-sequences in the MGA-s prediction. Sn, exact and Sp indicate sensitivity, sensitivity to start codons and specificity, respectively.

References

    1. Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1981;10:5303–5318. - PMC - PubMed
    1. Gribskov M., Devereux J., Burgess R. R. The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res. 1984;12:539–549. - PMC - PubMed
    1. Staden R. Measurements of the effects of that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res. 1984;12:551–567. - PMC - PubMed
    1. Borodovsky M. Y., Sprizhitskii Y. A., Golovanov E. I., Aleksandrov A. A. Statistical patterns in primary structures of functional regions in the E. coli genome: III. Computer recognition of coding regions. Mol. Biol. 1986;20:1145–1150.
    1. Borodovsky M. Y., McIninch J. D. GeneMark: parallel gene recognition for both DNA strands. Comput. Chem. 1993;17:123–153.