Mining SNPs from EST sequences using filters and ensemble classifiers
- PMID: 20449815
- DOI: 10.4238/vol9-2gmr765
Mining SNPs from EST sequences using filters and ensemble classifiers
Abstract
Abundant single nucleotide polymorphisms (SNPs) provide the most complete information for genome-wide association studies. However, due to the bottleneck of manual discovery of putative SNPs and the inaccessibility of the original sequencing reads, it is essential to develop a more efficient and accurate computational method for automated SNP detection. We propose a novel computational method to rapidly find true SNPs in public-available EST (expressed sequence tag) databases; this method is implemented as SNPDigger. EST sequences are clustered and aligned. SNP candidates are then obtained according to a measure of redundant frequency. Several new informative biological features, such as the structural neighbor profiles and the physical position of the SNP, were extracted from EST sequences, and the effectiveness of these features was demonstrated. An ensemble classifier, which employs a carefully selected feature set, was included for the imbalanced training data. The sensitivity and specificity of our method both exceeded 80% for human genetic data in the cross validation. Our method enables detection of SNPs from the user's own EST dataset and can be used on species for which there is no genome data. Our tests showed that this method can effectively guide SNP discovery in ESTs and will be useful to avoid and save the cost of biological analyses.
Similar articles
-
SNP mining porcine ESTs with MAVIANT, a novel tool for SNP evaluation and annotation.Bioinformatics. 2007 Jul 1;23(13):i387-91. doi: 10.1093/bioinformatics/btm192. Bioinformatics. 2007. PMID: 17646321
-
High-throughput identification, database storage and analysis of SNPs in EST sequences.Genome Inform. 2001;12:194-203. Genome Inform. 2001. PMID: 11791238
-
Mining SNPs from DNA sequence data; computational approaches to SNP discovery and analysis.Methods Mol Biol. 2009;578:73-91. doi: 10.1007/978-1-60327-411-1_4. Methods Mol Biol. 2009. PMID: 19768587
-
Single nucleotide polymorphism hunting in cyberspace.Hum Mutat. 1998;12(4):221-5. doi: 10.1002/(SICI)1098-1004(1998)12:4<221::AID-HUMU1>3.0.CO;2-I. Hum Mutat. 1998. PMID: 9744471 Review.
-
A hitchhiker's guide to expressed sequence tag (EST) analysis.Brief Bioinform. 2007 Jan;8(1):6-21. doi: 10.1093/bib/bbl015. Epub 2006 May 23. Brief Bioinform. 2007. PMID: 16772268 Review.
Cited by
-
Survey of Natural Language Processing Techniques in Bioinformatics.Comput Math Methods Med. 2015;2015:674296. doi: 10.1155/2015/674296. Epub 2015 Oct 7. Comput Math Methods Med. 2015. PMID: 26525745 Free PMC article. Review.
-
Transcriptome analysis of the differences in gene expression between testis and ovary in green mud crab (Scylla paramamosain).BMC Genomics. 2014 Jul 11;15(1):585. doi: 10.1186/1471-2164-15-585. BMC Genomics. 2014. PMID: 25015001 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials