. 2007 Mar 16;3(3):e54.

doi: 10.1371/journal.pcbi.0030054. Epub 2007 Feb 2.

Global discriminative learning for higher-accuracy computational gene prediction

Axel Bernal¹, Koby Crammer, Artemis Hatzigeorgiou, Fernando Pereira

Affiliations

PMID: 17367206
PMCID: PMC1828702
DOI: 10.1371/journal.pcbi.0030054

Global discriminative learning for higher-accuracy computational gene prediction

Axel Bernal et al. PLoS Comput Biol. 2007.

. 2007 Mar 16;3(3):e54.

doi: 10.1371/journal.pcbi.0030054. Epub 2007 Feb 2.

Authors

Axel Bernal¹, Koby Crammer, Artemis Hatzigeorgiou, Fernando Pereira

Affiliation

¹ Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America. abernal@seas.upenn.edu

PMID: 17367206
PMCID: PMC1828702
DOI: 10.1371/journal.pcbi.0030054

Abstract

Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

**Figure 1. Learning Methods: Discriminative versus Generative**
Schematic comparison of discriminative (A) and generative (B) learning methods. In the discriminative case, all model parameters were estimated simultaneously to predict a segmentation as similar as possible to the annotation. In contrast, for generative HMM models, signal features and state features were assumed to be independent and trained separately.

**Figure 3. F-Score versus Intron Length for the Encode Test Set**
Results in subfigures (A) and (B) correspond to the subset of alternatively spliced genes and its complementary subset, respectively.

**Figure 4. Signal Accuracy Improvements**
CRAIG's relative improvements in prediction specificity (orange bar) and sensitivity (blue bar) by signal type. In each case, the second-best program was used for the comparison: Genezilla for starts, Augustus for stops, and GenScan++ for splice sites.

**Figure 5. Finite-State Model for Eukaryotic Genes**
Variable-length genomic regions are represented by states, and biological signals are represented by transitions between states. Short and long introns are denoted by I^S and I^L, respectively.

See this image and copyright information in PMC

References

1. Burge CB, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998;8:346–354. - PubMed
1. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19(Supplement 2):II215–II225. - PubMed
1. Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: Two open source ab initio eukaryotic genefinders. Bioinformatics. 2004;20:2878–2879. - PubMed
1. Krogh A. Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol. 1997;5:179–186. - PubMed
1. Majoros WH, Salzberg SL. An empirical analysis of training protocols for probabilistic genefinders. BMC Bioinformatics. 2004;5:206. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Global discriminative learning for higher-accuracy computational gene prediction

Affiliation

Global discriminative learning for higher-accuracy computational gene prediction

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases