Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Nov;19(11):2133-43.
doi: 10.1101/gr.090597.108. Epub 2009 Jun 29.

mGene: accurate SVM-based gene finding with an application to nematode genomes

Affiliations

mGene: accurate SVM-based gene finding with an application to nematode genomes

Gabriele Schweikert et al. Genome Res. 2009 Nov.

Abstract

We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximately 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Improvement of mGene.init ab initio predictions on several evaluation levels: (A) nucleotide, (B) exon, (C) transcript, and (D) gene (each restricted to coding regions), as well as on selected signals: (E) acceptor splice sites, (F) donor splice sites, (G) TIS, (H) translation termination sites, (I) transcription start sites (TSS) (±20 nt), and (J) cleavage sites (±20 nt). mGene.init's predictions are compared with the predictions of the best submissions in category 1: Craig, Eugene, Fgenesh, and Augustus. Shown are differences of percent values for sensitivity (Sn; blue), specificity (Sp; green), and their average (red). Note that Craig and Fgenesh are not able to predict UTRs. We therefore used the predicted translation start and stop as an estimate of gene start and stop (relevant results are marked with an asterisk).
Figure 2.
Figure 2.
Comparison of the gene sets predicted by mGene.seq for different nematodes. (A) Number of protein-coding genes predicted for each organism and the fraction of genes with one-to-one orthologs, other orthologs with weak, and with no significant protein sequence similarity. (B) Agreement of internal exons inferred from aligned EST sequences with exons predicted by mGene.seq, Fgenesh++, Augustus, and Jigsaw. We counted a predicted exon as correct if both boundaries were correct, and as a false prediction if it overlapped a region covered by an EST alignment but did not exactly match an EST-confirmed exon. Shown is the average of sensitivity and specificity. (C) Number of orthologous groups (9885) shared among all five nematodes, as well as the number of additional orthologous groups shared across subtrees of more closely related species, which are defined by the corresponding ancestral node.
Figure 3.
Figure 3.
In layer 1, mGene scans the genomic sequence using SVM-based detectors trained to recognize transcription start sites (TSS), translation initiation sites (TIS), acceptor (Ace), and donor (Don) splice sites, the translation termination site (Stop), and other signals (data not shown). The detectors assign a score to each candidate site. In combination with additional information, including outputs of SVMs recognizing exon/intron content, and scores for exon/intron lengths (data not shown), these signal scores contribute to the cumulative score of a putative gene structure. The bottom graph (layer 2) illustrates the accumulation of scores for two gene structures shown at the top, where the score at the end of the sequence is the final score of the gene structure. The contributions from the individual detector outputs, from segment lengths, as well as from properties of the segments to the score are adjusted during training using piecewise linear functions (PLiFs; see inset to the right). They are optimized such that the margin between the true gene structure (shown in green) and all other (false) isoforms (one of them is shown in red) is maximized. Prediction of genes on new sequences works by selecting a valid gene structure, as defined by the gene model (cf. inset to the left), with the maximum cumulative score using dynamic programming (see e.g., Kulp et al. 1996).

Similar articles

Cited by

References

    1. Alexeyenko A, Tamas I, Liu G, Sonnhammer EL. Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006;22:e9–e15. - PubMed
    1. Allen JE, Majoros WH, Pertea M, Salzberg SL. JIGSAW, GeneZilla, and GlimmerHMM: Puzzling out the features of human genes in the encode regions. Genome Biol. 2006;7:S9. doi: 10.1186/gb-2006-7-sl–s9. - DOI - PMC - PubMed
    1. Altun Y, Tsochantaridis I, Hofmann T. Hidden Markov support vector machines. Scientific Commons; St. Gallen, Switzerland: 2003.
    1. Ben-Hur A, Ong CS, Sonnenburg S, Scholköpf B, Rätsch G. Support vector machines and kernels for computational biology. PLoS Comput Biol. 2008;4:e1000173. doi: 10.1371/journal.pcbi.1000173. - DOI - PMC - PubMed
    1. Bernal A, Crammer K, Hatzigeorgiou A, Pereira F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol, 2007;3:e54. doi: 10.1371/journal.pcbi.0030054. - DOI - PMC - PubMed

Substances

LinkOut - more resources