Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2006 Dec 18;7 Suppl 5(Suppl 5):S15.
doi: 10.1186/1471-2105-7-S5-S15.

Splice site identification using probabilistic parameters and SVM classification

Affiliations
Comparative Study

Splice site identification using probabilistic parameters and SVM classification

A K M A Baten et al. BMC Bioinformatics. .

Erratum in

  • BMC Bioinformatics. 2007 Jul 5;8:241

Abstract

Background: Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. Effective detection of splice sites requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the splice site surrounding region. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies. However, their implementation requires estimating a large number of parameters, which is computationally expensive.

Results: The proposed method for splice site detection consists of two stages: a first order Markov model (MM1) is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The MM1 serves as a pre-processing step for the SVM and takes DNA sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. When the proposed MM1-SVM model is compared with other existing standard splice site detection methods, it shows a superior performance in all the cases.

Conclusion: We proposed an effective pre-processing scheme for the SVM and applied it for the identification of splice sites. This is a simple yet effective splice site detection method, which shows a better classification accuracy and computational speed than some other more complex methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustration of acceptor and donor splice sites. Introns usually end with dinucleotides AG and the border between intron and exon in a DNA sequence is termed as acceptor splice site. Introns usually start with dinucleotides GT and the border between exon and intron in a DNA sequence is termed as donor splice site.
Figure 2
Figure 2
ROC curve showing the comparison of performance between methods MM1-SVM, WMM0/MM0-SVM, and WMM1-SVM using NN269 acceptor dataset. MM1-SVM and WMM1-SVM performs almost equally well. WMM0/MM0-SVM performs worst among the three methods.
Figure 3
Figure 3
ROC curve showing the comparison of performance between methods MM1-SVM, WMM0/MM0-SVM, and WMM1-SVM using NN269 donor dataset. MM1-SVM and WMM1-SVM performs almost equally well. WMM0/MM0-SVM performs worst among the three methods.
Figure 4
Figure 4
ROC curve showing the comparison of performance between MM1-SVM, Loi-Rajapakse method, NNSplice, and GeneSplicer using NN269 acceptor dataset. MM1-SVM produces the best performance while Loi-Rajapakse method produces the second best performance. NNSplice and GeneSplicer produce the worst performance in this case.
Figure 5
Figure 5
ROC curve showing the comparison of performance between MM1-SVM, Loi-Rajapakse method, NNSplice, and GeneSplicer using NN269 donor dataset. MM1-SVM produces the best prediction accuracy. Loi-Rajapakse method produces the second best performance while NNSplice produces the worst performance.
Figure 6
Figure 6
ROC curve showing the comparison of performance between MM1-SVM and MDD using DGSplicer acceptor dataset. MDD performs almost equally as good as MM1-SVM.
Figure 7
Figure 7
ROC curve showing the comparison of performance between MM1-SVM and MDD using DGSplicer donor dataset. MM1-SVM performs better than MDD.
Figure 8
Figure 8
Overview of the model. The input DNA sequence data is pre-processed by a first order Markov model which generates probabilistic parameters. A SVM with polynomial kernel takes these parameters as its input for the splice site classification.
Figure 9
Figure 9
Two sample logo [46] of NN269 acceptor splice sites. It shows nucleotides which are enriched and depleted in the surrounding regions of the acceptor splice sites. The conserved dinucleotides AG is located in positions 69 and 70 in the sequence.
Figure 10
Figure 10
Two sample logo [46] of NN269 donor splice sites. It shows nucleotides which are enriched and depleted in the surrounding regions of the donor splice sites. The conserved dinucleotides GT is located in positions 8 and 9 in the sequence.

References

    1. International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. - DOI - PubMed
    1. Bauren G, Wieslander L. Splicing of Balbiani ring 1 gene pre-mRNA occurs simultaneously with transcription. Cell. 1994;76:183–192. doi: 10.1016/0092-8674(94)90182-1. - DOI - PubMed
    1. Chen T-M, Lu , Chung-Chin , Li , Wen-Hsiung Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2005;21:471–482. doi: 10.1093/bioinformatics/bti025. - DOI - PubMed
    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, fitzHugh W. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
    1. Stanke M, Schoffmann O, Morgenstern B, Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006;7:62. doi: 10.1186/1471-2105-7-62. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources