Instability in progressive multiple sequence alignment algorithms

Kieran Boyce¹, Fabian Sievers¹, Desmond G Higgins¹

Affiliations

PMID: 26457114
PMCID: PMC4599319
DOI: 10.1186/s13015-015-0057-1

Instability in progressive multiple sequence alignment algorithms

Kieran Boyce et al. Algorithms Mol Biol. 2015.

. 2015 Oct 9:10:26.

doi: 10.1186/s13015-015-0057-1. eCollection 2015.

Authors

Kieran Boyce¹, Fabian Sievers¹, Desmond G Higgins¹

Affiliation

¹ Conway Institute of Biomolecular and Biomedical Research and UCD School of Medicine and Medical Science, University College Dublin, Dublin 4, Ireland.

PMID: 26457114
PMCID: PMC4599319
DOI: 10.1186/s13015-015-0057-1

Abstract

Background: Progressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time.

Results: We examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced.

Conclusions: This has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.

Keywords: Clustal; Kalign; Large scale alignment; Mafft; Multiple sequence alignment; Muscle; Pfam; Sequence order.

PubMed Disclaimer

Figures

**Fig. 1**
Difference in TC core scores for random samples and in reverse order. The difference in the TC core scores for 1000 randomly-selected sequences and in reverse order. 68 HomFam protein families. $n = 10$ samples per family

**Fig. 2**
Unique distances by number of sequences for each alignment program. The number of unique distances with increasing number of sequences. Each *line* is the mean of 100 samples for each HomFam protein family

**Fig. 3**
Theoretical maximum and actual number of unique distances, and maximum theoretical number of sequences that can be aligned without duplicate distance measures. The theoretical maximum number of unique distances for each HomFam family, the actual number of unique distances found in the datasets used to generate Fig. 1, and the maximum theoretical number of sequences that can be aligned without generating duplicate distance measures based on the calculation that N sequences will produce $N (N - 1) / 2$ distance measures

**Fig. 4**
Count of the differences in forward and reverse TC core scores. The number of samples within each HomFam family where the forward and reverse TC core scores are different. $n = 100$ samples for each family and dataset size

See this image and copyright information in PMC

References

1. Feng DF, Doolittle R. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987;25(4):351–360. doi: 10.1007/BF02603120. - DOI - PubMed
1. Higgins DG, Bleasby AJ, Fuchs R. CLUSTAL V: improved software for multiple sequence alignment. Comp Appl Biosci CABIOS. 1992;8(2):189–191. - PubMed
1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. doi: 10.1093/nar/gkh340. - DOI - PMC - PubMed
1. Dumas JP, Ninio J. Efficient algorithms for folding and comparing nucleic acid sequences. Nucleic Acids Res. 1982;10(1):197–206. doi: 10.1093/nar/10.1.197. - DOI - PMC - PubMed
1. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci. 1988;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Instability in progressive multiple sequence alignment algorithms

Affiliation

Instability in progressive multiple sequence alignment algorithms

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials