Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Oct 9:10:26.
doi: 10.1186/s13015-015-0057-1. eCollection 2015.

Instability in progressive multiple sequence alignment algorithms

Affiliations

Instability in progressive multiple sequence alignment algorithms

Kieran Boyce et al. Algorithms Mol Biol. .

Abstract

Background: Progressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time.

Results: We examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced.

Conclusions: This has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.

Keywords: Clustal; Kalign; Large scale alignment; Mafft; Multiple sequence alignment; Muscle; Pfam; Sequence order.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Difference in TC core scores for random samples and in reverse order. The difference in the TC core scores for 1000 randomly-selected sequences and in reverse order. 68 HomFam protein families. n=10 samples per family
Fig. 2
Fig. 2
Unique distances by number of sequences for each alignment program. The number of unique distances with increasing number of sequences. Each line is the mean of 100 samples for each HomFam protein family
Fig. 3
Fig. 3
Theoretical maximum and actual number of unique distances, and maximum theoretical number of sequences that can be aligned without duplicate distance measures. The theoretical maximum number of unique distances for each HomFam family, the actual number of unique distances found in the datasets used to generate Fig. 1, and the maximum theoretical number of sequences that can be aligned without generating duplicate distance measures based on the calculation that N sequences will produce N(N-1)/2 distance measures
Fig. 4
Fig. 4
Count of the differences in forward and reverse TC core scores. The number of samples within each HomFam family where the forward and reverse TC core scores are different. n=100 samples for each family and dataset size

References

    1. Feng DF, Doolittle R. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987;25(4):351–360. doi: 10.1007/BF02603120. - DOI - PubMed
    1. Higgins DG, Bleasby AJ, Fuchs R. CLUSTAL V: improved software for multiple sequence alignment. Comp Appl Biosci CABIOS. 1992;8(2):189–191. - PubMed
    1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. doi: 10.1093/nar/gkh340. - DOI - PMC - PubMed
    1. Dumas JP, Ninio J. Efficient algorithms for folding and comparing nucleic acid sequences. Nucleic Acids Res. 1982;10(1):197–206. doi: 10.1093/nar/10.1.197. - DOI - PMC - PubMed
    1. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci. 1988;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. - DOI - PMC - PubMed

LinkOut - more resources