ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES

Yonil Park¹, Sergey Sheetlin, John L Spouge

Affiliations

PMID: 20148197
PMCID: PMC2818155
DOI: 10.1214/08-AOS663

ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES

Yonil Park et al. Ann Stat. 2009.

. 2009 Dec 1;37(6A):3697.

doi: 10.1214/08-AOS663.

Authors

Yonil Park¹, Sergey Sheetlin, John L Spouge

Affiliation

¹ National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, Maryland 20894 USA.

PMID: 20148197
PMCID: PMC2818155
DOI: 10.1214/08-AOS663

Abstract

The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.

PubMed Disclaimer

Figures

**Fig. 1**
Gapped global alignment scores and the corresponding directed paths for two subsequences A[1, 10] = TACTAGCGCA *and* B[1, 9] = ACGGTAGAT, *drawn from the nucleotide alphabet* {A, C, G, T}. *Figure* 1 *uses a nucleotide scoring matrix, where s*(*a, b*) = 5 *if a* = *b and* –4 *otherwise, and the affine gap penalty w_g* = 3 + 2g. *The vertex* (*i, j*) *is in the northeast corner of the cell* (*i, j*), *with the origin* (0, 0) *at the southwest corner of Figure* 1. *The cell* (*i, j*) *displays the global score S_{i, j}, calculated from* (2.2). *The optimal global path ending at the point* (10, 8), *for example, consists of* 12 *edges, in order*: 1 *east of length* 1, 2 *northeast*, 1 *north of length* 2, 3 *northeast*, 1 *east of length* 3, *and* 1 *northeast. The optimal global score S*_{10, 8} = –5 + 5 + 5 – 7 + 5 + 5 + 5 – 9 + 5 = 9 is the sum of the corresponding edges and represents the path of greatest weight starting at (0, 0) *and ending at* (10, 8). The corresponding optimal global alignment of the subsequences A[1, 10] *and* B[1, 9] is $\begin{matrix} TAC - - TAGCGCA \\ - ACGGTAG - - - A . \end{matrix}$ *The edge maxima are M*₁ = –4, M₂ = 0, M₃ = 5, M₄ = 1, M₅ = 3, M₆ = 8, M₇ = 13, M₈ = 9, M₉ = 6. The shading and the double lines indicate squares where a vertex (*surrounded by double lines*) *generated an SALE β*(k). *The SALE scores are M*_β(1) = M₃ = 5, M_β(2) = M₆ = 8, M_β(3) = M₇ = 13; *and the global maximum M for* A *and* B *is no less than* 13, *the largest global score shown*.

**Fig. 2**
Two examples of alignment path π generated by a Markov chain. As in Figure 1, the shading and the double lines indicate squares where a vertex (*surrounded by double lines*) generated an SALE. The SALEs determine the stopping time N = β(3). *In Figure* 2, the first SALE is determined by the score at the vertex (3, 3); *the second SALE, the vertex* (7, 6); *the third SALE, the vertex* (9, 10). *Therefore, N* = β(3) = 10. *The vertical ray* I_{N, N}*and the horizontal ray* D_{N, N}*are indicated by double circles. The lower path π* (*solid line*) *ends at* (N + 2, N) *with a final transition to S*; *the upper path π* (*long-dashed line*), at (*N, N* + 4) with a final transition to D. The closed vertices indicate intersection with the square corresponding to $ω^{'} = (ω_{A}^{'}, ω_{B}^{'}) = (A [1, N], B [1, N])$ .

**Fig. 3**
*Plot of relative errors against computation time* (*sec*). Both axes are in logarithmic scale. Computation time was measured on a 2.99 *GHz Pentium*® *DCPU. Relative errors for BLOSUM*45 *with* Δ(g) = 14 + 2g are shown by ■; *BLOSUM*62 *with* Δ(g) 11 *g, by* ◆; *BLOSUM*80 *with* Δ(g) = 10 + *g, by* ▲; *PAM*70 *with* Δ(g) = 10 + *g, by* •; *PAM*30 *with* Δ(g) = 9 + *g, by* ○.

See this image and copyright information in PMC

References

1. Aldous D. Probability Approximations via the Poisson Clumping Heuristic. 1st ed. Springer; New York: 1989. MR0969362.
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Molecular Biology. 1990;215:403–410. - PubMed
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
1. Altschul SF, Bundschuh R, Olsen R, Hwa T. The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res. 2001;29:351–361. - PMC - PubMed
1. Asmussen S. Applied Probability and Queues. Springer; New York: 2003. MR1978607.

Grants and funding

Z01 LM000088/ImNIH/Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES

Affiliation

ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES

Authors

Affiliation

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources