Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Dec 16;33(22):7120-8.
doi: 10.1093/nar/gki1020. Print 2005.

Automatic assessment of alignment quality

Affiliations

Automatic assessment of alignment quality

Timo Lassmann et al. Nucleic Acids Res. .

Abstract

Multiple sequence alignments play a central role in the annotation of novel genomes. Given the biological and computational complexity of this task, the automatic generation of high-quality alignments remains challenging. Since multiple alignments are usually employed at the very start of data analysis pipelines, it is crucial to ensure high alignment quality. We describe a simple, yet elegant, solution to assess the biological accuracy of alignments automatically. Our approach is based on the comparison of several alignments of the same sequences. We introduce two functions to compare alignments: the average overlap score and the multiple overlap score. The former identifies difficult alignment cases by expressing the similarity among several alignments, while the latter estimates the biological correctness of individual alignments. We implemented both functions in the MUMSA program and demonstrate the overall robustness and accuracy of both functions on three large benchmark sets.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Histograms of the distribution of difficult/easy alignment cases in Balibase (A), Prefab (B), SABmark superfamily (C) and the SABmark twilight (D) benchmark test sets. The accuracy of each alignment was calculated by comparison to reference alignments using the sum-of-pairs (SP), Q and fD scores, respectively (see Materials and Methods). The SABmark twilight set consists of predominantly difficult cases while Balibase and Prefab sets contains mainly easy cases. The superfamily subset of SABmark is made up of an equal number of difficult and easy alignment cases.
Figure 2
Figure 2
Scatter-plots of estimated case difficulty using the average overlap score versus real difficulty: Balibase (A), Prefab (B), SABmark superfamily (C) and SABmark twilight (D). The Pearson correlation coefficient (r) is high for all test sets.
Figure 3
Figure 3
ROC curves demonstrating the agreement between real and predicted rank of several alignments of the same sequences: Balibase (A), Prefab (B), SABmark superfamily (C) and SABmark twilight (D). For al2co, we only show the best curve among all 9 combination of methods (Table 3, italic). For the Prefab set no meaningful results could be obtained using al2co. The rankings based on our MOSs are more accurate than the rankings according to norMD, al2co and sequence identity scores, accepting fewer false positives at comparable levels of of true positives. The predictions of all scores are less accurate on the SABmark sets than on Balibase and Prefab.

References

    1. Lecompte O., Thompson J.D., Plewniak F., Thierry J., Poch O. Multiple alignment of complete sequences (MACS) in the postgenomic era. Gene. 2001;270:17–30. - PubMed
    1. Do C.B., Mahabhashyam M.S.P., Brudno M., Batzoglou S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–340. - PMC - PubMed
    1. Katoh K., Kuma K.-i., Toh H., Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005;33:511–518. - PMC - PubMed
    1. Van Walle I., Lasters I., Wyns L. Align-m—a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics. 2004;20:1428–1435. - PubMed
    1. Edgar R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. - PMC - PubMed

Publication types