Global discriminative learning for higher-accuracy computational gene prediction
- PMID: 17367206
- PMCID: PMC1828702
- DOI: 10.1371/journal.pcbi.0030054
Global discriminative learning for higher-accuracy computational gene prediction
Abstract
Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.
Conflict of interest statement
Figures





Similar articles
-
Identification of coding and non-coding sequences using local Holder exponent formalism.Bioinformatics. 2005 Oct 15;21(20):3818-23. doi: 10.1093/bioinformatics/bti639. Epub 2005 Aug 23. Bioinformatics. 2005. PMID: 16118261
-
Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages.BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S6. doi: 10.1186/1471-2105-8-S4-S6. BMC Bioinformatics. 2007. PMID: 17570149 Free PMC article.
-
Mismatch string kernels for discriminative protein classification.Bioinformatics. 2004 Mar 1;20(4):467-76. doi: 10.1093/bioinformatics/btg431. Epub 2004 Jan 22. Bioinformatics. 2004. PMID: 14990442
-
Biological applications of support vector machines.Brief Bioinform. 2004 Dec;5(4):328-38. doi: 10.1093/bib/5.4.328. Brief Bioinform. 2004. PMID: 15606969 Review.
-
Well-characterized sequence features of eukaryote genomes and implications for ab initio gene prediction.Comput Struct Biotechnol J. 2016 Jul 27;14:298-303. doi: 10.1016/j.csbj.2016.07.002. eCollection 2016. Comput Struct Biotechnol J. 2016. PMID: 27536341 Free PMC article. Review.
Cited by
-
Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors.Genes (Basel). 2011 Jul 13;2(3):449-501. doi: 10.3390/genes2030449. Genes (Basel). 2011. PMID: 24710207 Free PMC article.
-
The nuclear and mitochondrial genomes of Frieseomelitta varia - a highly eusocial stingless bee (Meliponini) with a permanently sterile worker caste.BMC Genomics. 2020 Jun 3;21(1):386. doi: 10.1186/s12864-020-06784-8. BMC Genomics. 2020. PMID: 32493270 Free PMC article.
-
Reassessing domain architecture evolution of metazoan proteins: the contribution of different evolutionary mechanisms.Genes (Basel). 2011 Aug 5;2(3):578-98. doi: 10.3390/genes2030578. Genes (Basel). 2011. PMID: 24710211 Free PMC article.
-
A large-scale proteogenomics study of apicomplexan pathogens-Toxoplasma gondii and Neospora caninum.Proteomics. 2015 Aug;15(15):2618-28. doi: 10.1002/pmic.201400553. Epub 2015 May 15. Proteomics. 2015. PMID: 25867681 Free PMC article.
-
nGASP--the nematode genome annotation assessment project.BMC Bioinformatics. 2008 Dec 19;9:549. doi: 10.1186/1471-2105-9-549. BMC Bioinformatics. 2008. PMID: 19099578 Free PMC article.
References
-
- Burge CB, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998;8:346–354. - PubMed
-
- Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19(Supplement 2):II215–II225. - PubMed
-
- Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: Two open source ab initio eukaryotic genefinders. Bioinformatics. 2004;20:2878–2879. - PubMed
-
- Krogh A. Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol. 1997;5:179–186. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Molecular Biology Databases