Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Dec 1;37(6A):3697.
doi: 10.1214/08-AOS663.

ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES

Affiliations

ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES

Yonil Park et al. Ann Stat. .

Abstract

The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Gapped global alignment scores and the corresponding directed paths for two subsequences A[1, 10] = TACTAGCGCA and B[1, 9] = ACGGTAGAT, drawn from the nucleotide alphabet {A, C, G, T}. Figure 1 uses a nucleotide scoring matrix, where s(a, b) = 5 if a = b and –4 otherwise, and the affine gap penalty wg = 3 + 2g. The vertex (i, j) is in the northeast corner of the cell (i, j), with the origin (0, 0) at the southwest corner of Figure 1. The cell (i, j) displays the global score Si, j, calculated from (2.2). The optimal global path ending at the point (10, 8), for example, consists of 12 edges, in order: 1 east of length 1, 2 northeast, 1 north of length 2, 3 northeast, 1 east of length 3, and 1 northeast. The optimal global score S10, 8 = –5 + 5 + 5 – 7 + 5 + 5 + 5 – 9 + 5 = 9 is the sum of the corresponding edges and represents the path of greatest weight starting at (0, 0) and ending at (10, 8). The corresponding optimal global alignment of the subsequences A[1, 10] and B[1, 9] is TACTAGCGCAACGGTAGA. The edge maxima are M1 = –4, M2 = 0, M3 = 5, M4 = 1, M5 = 3, M6 = 8, M7 = 13, M8 = 9, M9 = 6. The shading and the double lines indicate squares where a vertex (surrounded by double lines) generated an SALE β(k). The SALE scores are Mβ(1) = M3 = 5, Mβ(2) = M6 = 8, Mβ(3) = M7 = 13; and the global maximum M for A and B is no less than 13, the largest global score shown.
Fig. 2
Fig. 2
Two examples of alignment path π generated by a Markov chain. As in Figure 1, the shading and the double lines indicate squares where a vertex (surrounded by double lines) generated an SALE. The SALEs determine the stopping time N = β(3). In Figure 2, the first SALE is determined by the score at the vertex (3, 3); the second SALE, the vertex (7, 6); the third SALE, the vertex (9, 10). Therefore, N = β(3) = 10. The vertical ray IN, N and the horizontal ray DN, N are indicated by double circles. The lower path π (solid line) ends at (N + 2, N) with a final transition to S; the upper path π (long-dashed line), at (N, N + 4) with a final transition to D. The closed vertices indicate intersection with the square corresponding to ω=(ωA,ωB)=(A[1,N],B[1,N]).
Fig. 3
Fig. 3
Plot of relative errors against computation time (sec). Both axes are in logarithmic scale. Computation time was measured on a 2.99 GHz Pentium® DCPU. Relative errors for BLOSUM45 with Δ(g) = 14 + 2g are shown by ■; BLOSUM62 with Δ(g) 11 g, by ◆; BLOSUM80 with Δ(g) = 10 + g, by ▲; PAM70 with Δ(g) = 10 + g, by •; PAM30 with Δ(g) = 9 + g, by ○.

References

    1. Aldous D. Probability Approximations via the Poisson Clumping Heuristic. 1st ed. Springer; New York: 1989. MR0969362.
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Molecular Biology. 1990;215:403–410. - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Altschul SF, Bundschuh R, Olsen R, Hwa T. The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res. 2001;29:351–361. - PMC - PubMed
    1. Asmussen S. Applied Probability and Queues. Springer; New York: 2003. MR1978607.

LinkOut - more resources