Toward an accurate statistics of gapped alignments
- PMID: 15691544
- DOI: 10.1016/j.bulm.2004.07.001
Toward an accurate statistics of gapped alignments
Abstract
Sequence alignment has been an invaluable tool for finding homologous sequences. The significance of the homology found is often quantified statistically by p-values. Theory for computing p-values exists for gapless alignments [Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264-2268; Karlin, S., Dembo A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13-140], but a full generalization to alignments with gaps is not yet complete. We present a unified statistical analysis of two common sequence comparison algorithms: maximum-score (Smith-Waterman) alignments and their generalized probabilistic counterparts, including maximum-likelihood alignments and hidden Markov models. The most important statistical characteristic of these algorithms is the distribution function of the maximum score S(max), resp. the maximum free energy F(max), for mutually uncorrelated random sequences. This distribution is known empirically to be of the Gumbel form with an exponential tail P(S(max)>x) approximately exp(-lambdax) for maximum-score alignment and P(F(max)>x) approximately exp(-lambdax) for some classes of probabilistic alignment. We derive an exact expression for lambda for particular probabilistic alignments. This result is then used to obtain accurate lambda values for generic probabilistic and maximum-score alignments. Although the result demonstrated uses a simple match-mismatch scoring system, it is expected to be a good starting point for more general scoring functions.
MeSH terms
LinkOut - more resources
Full Text Sources