In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists
- PMID: 17204465
- DOI: 10.1093/bioinformatics/btl639
In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists
Abstract
Motivation: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential.
Results: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes.
Supplementary data: http://bioinformatics.psb.ugent.be/.
Similar articles
-
FunSiP: a modular and extensible classifier for the prediction of functional sites in DNA.Bioinformatics. 2008 Jul 1;24(13):1532-3. doi: 10.1093/bioinformatics/btn225. Epub 2008 May 12. Bioinformatics. 2008. PMID: 18474505
-
Gene prediction with a hidden Markov model and a new intron submodel.Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080. Bioinformatics. 2003. PMID: 14534192
-
Identification of coding and non-coding sequences using local Holder exponent formalism.Bioinformatics. 2005 Oct 15;21(20):3818-23. doi: 10.1093/bioinformatics/bti639. Epub 2005 Aug 23. Bioinformatics. 2005. PMID: 16118261
-
Using MZEF to find internal coding exons.Curr Protoc Bioinformatics. 2002 Aug;Chapter 4:Unit 4.2. doi: 10.1002/0471250953.bi0402s00. Curr Protoc Bioinformatics. 2002. PMID: 18792940 Review.
-
Analysis of evolution of exon-intron structure of eukaryotic genes.Brief Bioinform. 2005 Jun;6(2):118-34. doi: 10.1093/bib/6.2.118. Brief Bioinform. 2005. PMID: 15975222 Review.
Cited by
-
SNR of DNA sequences mapped by general affine transformations of the indicator sequences.J Math Biol. 2013 Aug;67(2):433-51. doi: 10.1007/s00285-012-0564-3. Epub 2012 Jul 21. J Math Biol. 2013. PMID: 22821208
-
Ontologies for bioinformatics.Bioinform Biol Insights. 2008 Mar 12;2:187-200. doi: 10.4137/bbi.s451. Bioinform Biol Insights. 2008. PMID: 19812775 Free PMC article.
-
Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli.Microb Inform Exp. 2011 Jun 27;1(1):6. doi: 10.1186/2042-5783-1-6. Microb Inform Exp. 2011. PMID: 22587847 Free PMC article.
-
Small open reading frames: current prediction techniques and future prospect.Curr Protein Pept Sci. 2011 Sep;12(6):503-7. doi: 10.2174/138920311796957667. Curr Protein Pept Sci. 2011. PMID: 21787300 Free PMC article. Review.
-
Some novel intron positions in conserved Drosophila genes are caused by intron sliding or tandem duplication.BMC Evol Biol. 2010 May 26;10:156. doi: 10.1186/1471-2148-10-156. BMC Evol Biol. 2010. PMID: 20500887 Free PMC article.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources