Toward an accurate statistics of gapped alignments
- PMID: 15691544
- DOI: 10.1016/j.bulm.2004.07.001
Toward an accurate statistics of gapped alignments
Abstract
Sequence alignment has been an invaluable tool for finding homologous sequences. The significance of the homology found is often quantified statistically by p-values. Theory for computing p-values exists for gapless alignments [Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264-2268; Karlin, S., Dembo A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13-140], but a full generalization to alignments with gaps is not yet complete. We present a unified statistical analysis of two common sequence comparison algorithms: maximum-score (Smith-Waterman) alignments and their generalized probabilistic counterparts, including maximum-likelihood alignments and hidden Markov models. The most important statistical characteristic of these algorithms is the distribution function of the maximum score S(max), resp. the maximum free energy F(max), for mutually uncorrelated random sequences. This distribution is known empirically to be of the Gumbel form with an exponential tail P(S(max)>x) approximately exp(-lambdax) for maximum-score alignment and P(F(max)>x) approximately exp(-lambdax) for some classes of probabilistic alignment. We derive an exact expression for lambda for particular probabilistic alignments. This result is then used to obtain accurate lambda values for generic probabilistic and maximum-score alignments. Although the result demonstrated uses a simple match-mismatch scoring system, it is expected to be a good starting point for more general scoring functions.
Similar articles
-
Calibrating E-values for hidden Markov models using reverse-sequence null models.Bioinformatics. 2005 Nov 15;21(22):4107-15. doi: 10.1093/bioinformatics/bti629. Epub 2005 Aug 25. Bioinformatics. 2005. PMID: 16123115
-
Statistical significance of probabilistic sequence alignment and related local hidden Markov models.J Comput Biol. 2001;8(3):249-82. doi: 10.1089/10665270152530845. J Comput Biol. 2001. PMID: 11535176
-
From analysis of protein structural alignments toward a novel approach to align protein sequences.Proteins. 2004 Feb 15;54(3):569-82. doi: 10.1002/prot.10503. Proteins. 2004. PMID: 14748004
-
Probability, statistics, and computational science.Methods Mol Biol. 2012;855:77-110. doi: 10.1007/978-1-61779-582-4_3. Methods Mol Biol. 2012. PMID: 22407706 Review.
-
Statistical significance in biological sequence analysis.Brief Bioinform. 2006 Mar;7(1):2-24. doi: 10.1093/bib/bbk001. Brief Bioinform. 2006. PMID: 16761361 Review.
Cited by
-
Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail.Algorithms Mol Biol. 2007 Jul 11;2:9. doi: 10.1186/1748-7188-2-9. Algorithms Mol Biol. 2007. PMID: 17625018 Free PMC article.
-
Geometric aspects of biological sequence comparison.J Comput Biol. 2009 Apr;16(4):579-610. doi: 10.1089/cmb.2008.0100. J Comput Biol. 2009. PMID: 19361329 Free PMC article.
-
Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty.BMC Bioinformatics. 2009 Mar 19;10 Suppl 3(Suppl 3):S1. doi: 10.1186/1471-2105-10-S3-S1. BMC Bioinformatics. 2009. PMID: 19344477 Free PMC article.
-
A simple derivation of the distribution of pairwise local protein sequence alignment scores.Evol Bioinform Online. 2008 Feb 14;4:41-5. Evol Bioinform Online. 2008. PMID: 19204806 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources