Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Nov 3:7:484.
doi: 10.1186/1471-2105-7-484.

A statistical score for assessing the quality of multiple sequence alignments

Affiliations

A statistical score for assessing the quality of multiple sequence alignments

Virpi Ahola et al. BMC Bioinformatics. .

Abstract

Background: Multiple sequence alignment is the foundation of many important applications in bioinformatics that aim at detecting functionally important regions, predicting protein structures, building phylogenetic trees etc. Although the automatic construction of a multiple sequence alignment for a set of remotely related sequences cause a very challenging and error-prone task, many downstream analyses still rely heavily on the accuracy of the alignments.

Results: To address the need for an objective evaluation framework, we introduce a statistical score that assesses the quality of a given multiple sequence alignment. The quality assessment is based on counting the number of significantly conserved positions in the alignment using importance sampling method in conjunction with statistical profile analysis framework. We first evaluate a novel objective function used in the alignment quality score for measuring the positional conservation. The results for the Src homology 2 (SH2) domain, Ras-like proteins, peptidase M13, subtilase and beta-lactamase families demonstrate that the score can distinguish sequence patterns with different degrees of conservation. Secondly, we evaluate the quality of the alignments produced by several widely used multiple sequence alignment programs using a novel alignment quality score and a commonly used sum of pairs method. According to these results, the Mafft strategy L-INS-i outperforms the other methods, although the difference between the Probcons, TCoffee and Muscle is mostly insignificant. The novel alignment quality score provides similar results than the sum of pairs method.

Conclusion: The results indicate that the proposed statistical score is useful in assessing the quality of multiple sequence alignments.

PubMed Disclaimer

Figures

Figure 1
Figure 1
MultiDisp visualization of part of the Ras-like proteins (upper) and the corresponding scaled -log(p)-values (lower). The curves show the p-values calculated using (red) Blosum62, (green) Gonnet250, (black) PAM250, (magenta) identity scoring matrices and (blue) classification of the amino acids for the Ras-like proteins.
Figure 2
Figure 2
MultiDisp visualization of the a) βB-stand, b) βD-stand and c) αB-helix of the SH2 domain (upper) and the corresponding conservation scores (lower). The curves show (red) the scaled -log(p)-values, (blue) Mean Distance and (green) Information content scores for the alignment. Consensus sequence for the alignment positions in c) is F P S L P E L V E H Y.
Figure 3
Figure 3
MultiDisp visualization of the a) I, b) II, c) III and d) IV motifs of the peptidase M13, e) I, f) II and g) III motifs of the subtilase, and h) I and i) II motifs of the β-lactamase families and the table of the conservation scores. MD = mean distance, IC = information content scores and maxZ = scaled -log(p)-values for the alignment.
Figure 4
Figure 4
Scatterplot between the AQ and SP scores for the Mafft (L-INS-i) alignments (r = 0.53). Four outlying alignments on the bottom right corner are from the reference sets 11 and 40.
Figure 5
Figure 5
Barplots for the median (red) AQ, (green) SP and (blue) CP scores in the BAliBASE reference sets. Error bars show the 25% and 75% percentile values.

References

    1. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27:2682–2690. doi: 10.1093/nar/27.13.2682. - DOI - PMC - PubMed
    1. Karplus K, Hu BR. Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics. 2001;17:713–720. doi: 10.1093/bioinformatics/17.8.713. - DOI - PubMed
    1. Lassmann T, Sonnhammer ELL. Quality assessment of multiple alignment programs. FEBS Lett. 2002;529:126–130. doi: 10.1016/S0014-5793(02)03189-7. - DOI - PubMed
    1. O'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics. 2003;19:i215–221. doi: 10.1093/bioinformatics/btg1029. - DOI - PubMed
    1. Lassmann T, Sonnhammer ELL. Automatic assessment of alignment quality. Nucleic Acids Res. 2005;33:7120–7128. doi: 10.1093/nar/gki1020. - DOI - PMC - PubMed

Publication types

LinkOut - more resources