Local homology recognition and distance measures in linear time using compressed amino acid alphabets

Robert C Edgar¹

Affiliations

PMID: 14729922
PMCID: PMC373290
DOI: 10.1093/nar/gkh180

Local homology recognition and distance measures in linear time using compressed amino acid alphabets

Robert C Edgar. Nucleic Acids Res. 2004.

. 2004 Jan 16;32(1):380-5.

doi: 10.1093/nar/gkh180. Print 2004.

Author

Robert C Edgar¹

Affiliation

¹ bob@drive5.com

PMID: 14729922
PMCID: PMC373290
DOI: 10.1093/nar/gkh180

Abstract

Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.

PubMed Disclaimer

Figures

**Figure 1**
Here we show a core block from the CLUSTAL_W alignment of selected sequences containing SH2 domains. The upper alignment uses A, the standard amino acid alphabet. The lower version is the same alignment presented in the SE-V(10) alphabet (Table 1) in which members of a class are represented by its first letter in alphabetical order (e.g. I, L, M and V are shown as I). Columns that are perfectly conserved in the given alphabet are indicated with upper case letters and an asterisk (*) below. The number of conserved columns increases from 17 in A to 25 in SE-V(10). The number of fully conserved k-mers increases from 6 to 13 for *k =* 3 and from 4 to 9 for k = 4.

**Figure 2**
Scatterplots show the correlation between D, the fractional identity in the full alphabet computed from CLUSTAL_W alignments, and the fractional common k-mer count (F, equation 3) or the k-mer distance (Y, equation 4). We show three selected cases: the full alphabet A, k = 3 [the parameters used in (16)]; Dayhoff(6), k = 6 (used by MAFFT); and an intermediate case, SE-B(10), k = 4. Note that the relationship between D and Y is approximately linear.

**Figure 3**
The geometry of a diagonal (bold line) in a dynamic programming matrix. The length of the diagonal is λ, the margin is µ. If the diagonal can be identified in a pre-processing step such as k-mer extension or FFT, then the shaded region can be excluded, reducing the time needed to compute a global alignment by dynamic programming.

See this image and copyright information in PMC

References

1. Altschul S.F., Gish,W., Miller,W., Myers,E.E. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
1. Sneath P.H.A. and Sokal,R.R. (1973) Numerical Taxonomy. Freeman, San Francisco, CA.
1. Saitou N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. - PubMed
1. Thompson J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL_W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. - PMC - PubMed
1. Durbin R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge, UK.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Local homology recognition and distance measures in linear time using compressed amino acid alphabets

Affiliation

Local homology recognition and distance measures in linear time using compressed amino acid alphabets

Author

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous