Sequence embedding for fast construction of guide trees for multiple sequence alignment

Gordon Blackshields¹, Fabian Sievers, Weifeng Shi, Andreas Wilm, Desmond G Higgins

Affiliations

PMID: 20470396
PMCID: PMC2893182
DOI: 10.1186/1748-7188-5-21

Sequence embedding for fast construction of guide trees for multiple sequence alignment

Gordon Blackshields et al. Algorithms Mol Biol. 2010.

. 2010 May 14:5:21.

doi: 10.1186/1748-7188-5-21.

Authors

Gordon Blackshields¹, Fabian Sievers, Weifeng Shi, Andreas Wilm, Desmond G Higgins

Affiliation

¹ UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland. fabian.sievers@ucd.ie.

PMID: 20470396
PMCID: PMC2893182
DOI: 10.1186/1748-7188-5-21

Abstract

Background: The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.

Results: In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.

Conclusions: We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.

PubMed Disclaimer

Figures

**Figure 1**
**Tree Distances**. Tree distances of SparseMap and mBed guide trees from full matrix guide trees for the BAliBase benchmark set (386 families), using the Robinson-Foulds metric. Data points above the bisectrix (red) indicate instances where the SparseMap tree is inferior to the mBed tree, and vice versa. Multiple data points may lie on top of each other.

**Figure 2**
**Complexity of Embedding**. Total time required to compute a full pair-wise distance matrix (red) is plotted against time taken to embed sequences (blue) for each entry in the HOMSTRAD/Pfam dataset (containing up to 10,000 sequences per entry).

**Figure 3**
**Times for embedding up to 300,000 tRNA sequences**. Number of calls to the d(*x, y*) distance function made during computation of a full pair-wise distance matrix (red), plotted against number of sequences for random subsets of Rfam entry RF00005 which contains 381,602 tRNA sequences. We only show the number of calls up to 40,000 sequences. In blue we show the times for embedding subsets up to 300,000 sequences in size. The full data set takes 40 minutes to embed.

**Figure 4**
**Variation in alignment score induced by choice of guide tree**. Alignment quality scores for a collection of five test cases (a-e) taken from the HOMSTRAD/Pfam dataset, and aligned with ClustalW using guide trees generated from a variety of sources. Quality scores using guide trees from ClustalW -quicktree and from mBed are shown as arrows. Scores from 1000 randomised guide trees are shown in dark blue. Scores from 1000 SparseMap guide trees are shown in light blue.

**Figure 5**
**PCA visualisation of embedded H3 Influenza virus sequences**. An embedding of 3994 GenBank haemaglutinin sequences from H3N2 influenza viruses, generated using mBed, and visualised using the first three axes of a PCA of the embedded vectors. Each sequence has been coloured by year of isolation to show the progression of sequence change between the years 1967 (blue) and 2008 (red).

See this image and copyright information in PMC

References

1. Hogeweg P, Hesper B. The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J Mol Evol. 1984;20(2):175–86. doi: 10.1007/BF02257378. - DOI - PubMed
1. Taylor WR. Multiple sequence alignment by a pairwise algorithm. Comput Appl Biosci. 1987;3(2):81–7. - PubMed
1. Feng DF, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987;25(4):351–60. doi: 10.1007/BF02603120. - DOI - PubMed
1. Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302:205–217. doi: 10.1006/jmbi.2000.4042. - DOI - PubMed
1. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–80. doi: 10.1093/nar/22.22.4673. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Sequence embedding for fast construction of guide trees for multiple sequence alignment

Affiliation

Sequence embedding for fast construction of guide trees for multiple sequence alignment

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources