Using CLUSTAL for multiple sequence alignments
- PMID: 8743695
- DOI: 10.1016/s0076-6879(96)66024-8
Using CLUSTAL for multiple sequence alignments
Abstract
We have tested CLUSTAL W in a wide variety of situations, and it is capable of handling some very difficult protein alignment problems. If the data set consists of enough closely related sequences so that the first alignments are accurate, then CLUSTAL W will usually find an alignment that is very close to ideal. Problems can still occur if the data set includes sequences of greatly different lengths or if some sequences include long regions that are impossible to align with the rest of the data set. Trying to balance the need for long insertions and deletions in some alignments with the need to avoid them in others is still a problem. The default values for our parameters were tested empirically using test cases of sets of globular proteins where some information as to the correct alignment was available. The parameter values may not be very appropriate with nonglobular proteins. We have argued that using one weight matrix and two gap penalties is too simplistic to be of general use in the most difficult cases. We have replaced these parameters with a large number of new parameters designed primarily to help encourage gaps in loop regions. Although these new parameters are largely heuristic in nature, they perform surprisingly well and are simple to implement. The underlying speed of the progressive alignment approach is not adversely affected. The disadvantage is that the parameter space is now huge; the number of possible combinations of parameters is more than can easily be examined by hand. We justify this by asking the user to treat CLUSTAL W as a data exploration tool rather than as a definitive analysis method. It is not sensible to automatically derive multiple alignments and to trust particular algorithms as being capable of always getting the correct answer. One must examine the alignments closely, especially in conjunction with the underlying phylogenetic tree (or estimate of it) and try varying some of the parameters. Outliers (sequences that have no close relatives) should be aligned carefully, as should fragments of sequences. The program will automatically delay the alignment of any sequences that are less than 40% identical to any others until all other sequences are aligned, but this can be set from a menu by the user. It may be useful to build up an alignment of closely related sequences first and to then add in the more distant relatives one at a time or in batches, using the profile alignments and weighting scheme described earlier and perhaps using a variety of parameter settings. We give one example using SH2 domains. SH2 domains are widespread in eukaryotic signalling proteins where they function in the recognition of phosphotyrosine-containing peptides. In the chapter by Bork and Gibson ([11], this volume), Blast and pattern/profile searches were used to extract the set of known SH2 domains and to search for new members. (Profiles used in database searches are conceptually very similar to the profiles used in CLUSTAL W: see the chapters [11] and [13] for profile search methods.) The profile searches detected SH2 domains in the JAK family of protein tyrosine kinases, which were thought not to contain SH2 domains. Although the JAK family SH2 domains are rather divergent, they have the necessary core structural residues as well as the critical positively charged residue that binds phosphotyrosine, leaving no doubt that they are bona fide SH2 domains. The five new JAK family SH2 domains were added sequentially to the existing alignment of 65 SH2 domains using the CLUSTAL W profile alignment option. Figure 6 shows part of the resulting alignment. Despite their divergent sequences, the new SH2 domains have been aligned nearly perfectly with the old set. No insertions were placed in the original SH2 domains. In this example, the profile alignment procedure has produced better results than a one-step full alignment of all 70 SH2 domains, and in considerably less time. (ABSTRACT TRUNCATED)
Similar articles
-
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.Nucleic Acids Res. 1994 Nov 11;22(22):4673-80. doi: 10.1093/nar/22.22.4673. Nucleic Acids Res. 1994. PMID: 7984417 Free PMC article.
-
Combining multiple structure and sequence alignments to improve sequence detection and alignment: application to the SH2 domains of Janus kinases.Proc Natl Acad Sci U S A. 2001 Dec 18;98(26):14796-801. doi: 10.1073/pnas.011577898. Proc Natl Acad Sci U S A. 2001. PMID: 11752426 Free PMC article.
-
The performance of several multiple-sequence alignment programs in relation to secondary-structure features for an rRNA sequence.Mol Biol Evol. 2000 Apr;17(4):530-9. doi: 10.1093/oxfordjournals.molbev.a026333. Mol Biol Evol. 2000. PMID: 10742045
-
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389. Nucleic Acids Res. 1997. PMID: 9254694 Free PMC article. Review.
-
Bioinformatics in protein analysis.EXS. 2000;88:215-31. doi: 10.1007/978-3-0348-8458-7_14. EXS. 2000. PMID: 10803381 Review.
Cited by
-
Assessment of homology templates and an anesthetic binding site within the γ-aminobutyric acid receptor.Anesthesiology. 2013 Nov;119(5):1087-95. doi: 10.1097/ALN.0b013e31829e47e3. Anesthesiology. 2013. PMID: 23770602 Free PMC article.
-
Viscosity dictates metabolic activity of Vibrio ruber.Front Microbiol. 2012 Jul 18;3:255. doi: 10.3389/fmicb.2012.00255. eCollection 2012. Front Microbiol. 2012. PMID: 22826705 Free PMC article.
-
Molecular and immunological characterization of allergens from the entomopathogenic fungus Beauveria bassiana.Clin Mol Allergy. 2006 Sep 22;4:12. doi: 10.1186/1476-7961-4-12. Clin Mol Allergy. 2006. PMID: 16995945 Free PMC article.
-
The structure of a protein primer-polymerase complex in the initiation of genome replication.EMBO J. 2006 Feb 22;25(4):880-8. doi: 10.1038/sj.emboj.7600971. Epub 2006 Feb 2. EMBO J. 2006. PMID: 16456546 Free PMC article.
-
Arabidopsis proteins containing similarity to the universal stress protein domain of bacteria.Plant Physiol. 2003 Mar;131(3):1209-19. doi: 10.1104/pp.102.016006. Plant Physiol. 2003. PMID: 12644671 Free PMC article.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous