Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Dec 5:7:527.
doi: 10.1186/1471-2105-7-527.

Logarithmic gap costs decrease alignment accuracy

Affiliations

Logarithmic gap costs decrease alignment accuracy

Reed A Cartwright. BMC Bioinformatics. .

Abstract

Background: Studies on the distribution of indel sizes have consistently found that they obey a power law. This finding has lead several scientists to propose that logarithmic gap costs, G (k) = a + c ln k, are more biologically realistic than affine gap costs, G (k) = a + bk, for sequence alignment. Since quick and efficient affine costs are currently the most popular way to globally align sequences, the goal of this paper is to determine whether logarithmic gap costs improve alignment accuracy significantly enough the merit their use over the faster affine gap costs.

Results: A group of simulated sequences pairs were globally aligned using affine, logarithmic, and log-affine gap costs. Alignment accuracy was calculated by comparing resulting alignments to actual alignments of the sequence pairs. Gap costs were then compared based on average alignment accuracy. Log-affine gap costs had the best accuracy, followed closely by affine gap costs, while logarithmic gap costs performed poorly. Subsequently a model was developed to explain the results.

Conclusion: In contrast to initial expectations, logarithmic gap costs produce poor alignments and are actually not implied by the power-law behavior of gap sizes, given typical match and mismatch costs. Furthermore, affine gap costs not only produce accurate alignments but are also good approximations to biologically realistic gap costs. This work provides added confidence for the biological relevance of existing alignment algorithms.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example alignment pair. Numbers identify the residues in the sequences. k1 columns – A5B-, A6B-, A7B5, and A8B6 – are found in only the left alignment. K2 columns – A7B-, A8B-, A5B5, and A6B6 – are found in only the right alignment. K3 columns – A1B1, A2B2, A3B3, and A4B4 – are found in both alignments. Alignment identity is I = (2K3)/(2K3 + K1 + K2) = (2 × 4)/(2 × 4 + 4 + 4) = 1/2.
Figure 2
Figure 2
Gap Sizes Obey a Powerlaw. Log-Log plot of the distribution of gap sizes measured from the 5000 true alignments. The line is the maximum likelihood fit of a power-law distribution: ln f (k) = 0.915 – 1.53 ln k
Figure 3
Figure 3
The curves of the best gap costs. A) The entire range of the curves and B) a magnification of the beginning of the curves. The best gap costs were decided for each scheme based on highest average alignment identity. Log-Affine: G (k) = 2 + k/4 + (ln k)/2 (solid) , affine GA (x) = 4 + k/4 (dashed), and logarithmic GL (k) = 1/8 + 8 ln k (dotted).
Figure 4
Figure 4
Accuracy distribution of best gap costs. Best log-affine (solid), best affine (dashed), and best logarithmic (dotted). Accuracy is measured via alignment identity. See Figure 3 for details on the exact gap costs.
Figure 5
Figure 5
Accuracies of best costs plotted by divergence. I, IA, and IL are the alignment identities produced by the best log-affine, affine, and logarithmic gap penalties, respectively. See Figure 3 for more information. a-c) Alignment identities plotted by the branch length of the alignments. Divergence time is plotted on a uniform scale, u = 1 - exp (-t/t¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG0baDgaqeaaaa@2E35@). d-f) Box-whisker plots of identities grouped into 20 bins of 250 values. Solid bars are medians. Notches are significant range of medians. Bars are the mid-range. Whiskers are the range. Circles are outliers.
Figure 6
Figure 6
Accuracies of best costs compared per sequence. Ratio of identities produced by a) best affine gap cost and b) best logarithmic gap cost to the identities produced by the log-affine gap cost plotted for each sequence pair by divergence time. See Figure 5 for more information.
Figure 7
Figure 7
Maximum accuracies plotted by divergence. S, SA, and SL are the maximum alignment identity produced for each sequence pair by log-affine, affine, and logarithmic gap costs respectively. The subfigures are the same as in Figure 5.
Figure 8
Figure 8
Maximum accuracies compared per sequence. Ratio of maximum identities produced by a) affine gap costs and b) logarithmic gap costs to the maximum identities produced by log-affine gap costs plotted for each sequence pair by divergence time. See Figures 5-7 for more information.

References

    1. Swofford DL. PAUP*: Phylogenetic Analysis Using Parsimony (and Other Methods) 4.0 Beta. Sinauer Associates, Inc, Sunderland MA; 2002.
    1. Odgen T, Rosenberg M. Multiple Sequence Alignment Accuracy and Phylogenetic Inference. Systematic Biology. 2006;55:314–328. doi: 10.1080/10635150500541730. - DOI - PubMed
    1. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. - DOI - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1006/jmbi.1990.9999. - DOI - PubMed
    1. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. - PMC - PubMed

Publication types

LinkOut - more resources