Comparative Study

. 2006 Dec 18;7 Suppl 5(Suppl 5):S15.

doi: 10.1186/1471-2105-7-S5-S15.

Splice site identification using probabilistic parameters and SVM classification

A K M A Baten¹, B C H Chang, S K Halgamuge, Jason Li

Affiliations

PMID: 17254299
PMCID: PMC1764471
DOI: 10.1186/1471-2105-7-S5-S15

Comparative Study

Splice site identification using probabilistic parameters and SVM classification

A K M A Baten et al. BMC Bioinformatics. 2006.

. 2006 Dec 18;7 Suppl 5(Suppl 5):S15.

doi: 10.1186/1471-2105-7-S5-S15.

Authors

A K M A Baten¹, B C H Chang, S K Halgamuge, Jason Li

Affiliation

¹ Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia. a.baten@pgrad.unimelb.edu.au

PMID: 17254299
PMCID: PMC1764471
DOI: 10.1186/1471-2105-7-S5-S15

Erratum in

BMC Bioinformatics. 2007 Jul 5;8:241

Abstract

Background: Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. Effective detection of splice sites requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the splice site surrounding region. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies. However, their implementation requires estimating a large number of parameters, which is computationally expensive.

Results: The proposed method for splice site detection consists of two stages: a first order Markov model (MM1) is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The MM1 serves as a pre-processing step for the SVM and takes DNA sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. When the proposed MM1-SVM model is compared with other existing standard splice site detection methods, it shows a superior performance in all the cases.

Conclusion: We proposed an effective pre-processing scheme for the SVM and applied it for the identification of splice sites. This is a simple yet effective splice site detection method, which shows a better classification accuracy and computational speed than some other more complex methods.

PubMed Disclaimer

Figures

**Figure 1**
Illustration of acceptor and donor splice sites. Introns usually end with dinucleotides AG and the border between intron and exon in a DNA sequence is termed as acceptor splice site. Introns usually start with dinucleotides GT and the border between exon and intron in a DNA sequence is termed as donor splice site.

**Figure 2**
ROC curve showing the comparison of performance between methods MM1-SVM, WMM0/MM0-SVM, and WMM1-SVM using NN269 acceptor dataset. MM1-SVM and WMM1-SVM performs almost equally well. WMM0/MM0-SVM performs worst among the three methods.

**Figure 3**
ROC curve showing the comparison of performance between methods MM1-SVM, WMM0/MM0-SVM, and WMM1-SVM using NN269 donor dataset. MM1-SVM and WMM1-SVM performs almost equally well. WMM0/MM0-SVM performs worst among the three methods.

**Figure 4**
ROC curve showing the comparison of performance between MM1-SVM, Loi-Rajapakse method, NNSplice, and GeneSplicer using NN269 acceptor dataset. MM1-SVM produces the best performance while Loi-Rajapakse method produces the second best performance. NNSplice and GeneSplicer produce the worst performance in this case.

**Figure 5**
ROC curve showing the comparison of performance between MM1-SVM, Loi-Rajapakse method, NNSplice, and GeneSplicer using NN269 donor dataset. MM1-SVM produces the best prediction accuracy. Loi-Rajapakse method produces the second best performance while NNSplice produces the worst performance.

**Figure 6**
ROC curve showing the comparison of performance between MM1-SVM and MDD using DGSplicer acceptor dataset. MDD performs almost equally as good as MM1-SVM.

**Figure 7**
ROC curve showing the comparison of performance between MM1-SVM and MDD using DGSplicer donor dataset. MM1-SVM performs better than MDD.

**Figure 8**
Overview of the model. The input DNA sequence data is pre-processed by a first order Markov model which generates probabilistic parameters. A SVM with polynomial kernel takes these parameters as its input for the splice site classification.

**Figure 9**
Two sample logo [46] of NN269 acceptor splice sites. It shows nucleotides which are enriched and depleted in the surrounding regions of the acceptor splice sites. The conserved dinucleotides AG is located in positions 69 and 70 in the sequence.

**Figure 10**
Two sample logo [46] of NN269 donor splice sites. It shows nucleotides which are enriched and depleted in the surrounding regions of the donor splice sites. The conserved dinucleotides GT is located in positions 8 and 9 in the sequence.

See this image and copyright information in PMC

References

1. International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. - DOI - PubMed
1. Bauren G, Wieslander L. Splicing of Balbiani ring 1 gene pre-mRNA occurs simultaneously with transcription. Cell. 1994;76:183–192. doi: 10.1016/0092-8674(94)90182-1. - DOI - PubMed
1. Chen T-M, Lu , Chung-Chin , Li , Wen-Hsiung Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2005;21:471–482. doi: 10.1093/bioinformatics/bti025. - DOI - PubMed
1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, fitzHugh W. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
1. Stanke M, Schoffmann O, Morgenstern B, Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006;7:62. doi: 10.1186/1471-2105-7-62. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Splice site identification using probabilistic parameters and SVM classification

Affiliation

Splice site identification using probabilistic parameters and SVM classification

Authors

Affiliation

Erratum in

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources