Gaps in structurally similar proteins: towards improvement of multiple sequence alignment
- PMID: 14705025
- DOI: 10.1002/prot.10508
Gaps in structurally similar proteins: towards improvement of multiple sequence alignment
Abstract
An algorithm was developed to locally optimize gaps from the FSSP database. Over 2 million gaps were identified from all versus all FSSP structure comparisons, and datasets of non-identical gaps and flanking regions comprising between 90,000 and 135,000 sequence fragments were extracted for statistical analysis. Relative to background frequencies, gaps were enriched in residue types with small side chains and high turn propensity (D, G, N, P, S), and were depleted in residue types with hydrophobic side chains (C, F, I, L, V, W, Y). In contrast, regions flanking a gap exhibited opposite trends in amino acid frequencies, i.e., enrichment in hydrophobic residues and a high degree of secondary structure. Log-odds scores of residue type as a function of position in or around a gap were derived from the statistics. Three simple experiments demonstrated that these scores contained significant predictive information. First, regions where gaps were observed in single sequences taken from HOMSTRAD structure-based multiple sequence alignments generally scored higher than regions where gaps were not observed. Second, given the correct pairwise-aligned cores, the actual positions of gaps could be reproduced from sequence more accurately using the structurally-derived statistics than by using random pairwise alignments. Finally, revision of the Clustal-W residue-specific gap opening parameters with this new information improved the agreement of Clustal-W alignments with the structure-based alignments. At least three applications for these results are envisioned: improvement of gap penalties in pairwise (or multiple) sequence alignment, prediction of regions of single sequences likely (or unlikely) to contain indels, and more accurate placement of gaps in automated pairwise structure alignment.
Copyright 2003 Wiley-Liss, Inc.
Similar articles
-
Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments.J Mol Biol. 2004 Aug 6;341(2):617-31. doi: 10.1016/j.jmb.2004.05.045. J Mol Biol. 2004. PMID: 15276848
-
A neural network method for prediction of beta-turn types in proteins using evolutionary information.Bioinformatics. 2004 Nov 1;20(16):2751-8. doi: 10.1093/bioinformatics/bth322. Epub 2004 May 14. Bioinformatics. 2004. PMID: 15145798
-
PROMALS: towards accurate multiple sequence alignments of distantly related proteins.Bioinformatics. 2007 Apr 1;23(7):802-8. doi: 10.1093/bioinformatics/btm017. Epub 2007 Jan 31. Bioinformatics. 2007. PMID: 17267437
-
Multiple sequence alignment as a guideline for protein engineering strategies.Methods Mol Biol. 2006;340:171-81. doi: 10.1385/1-59745-116-9:171. Methods Mol Biol. 2006. PMID: 16957337 Review.
-
Multiple sequence alignment in phylogenetic analysis.Mol Phylogenet Evol. 2000 Sep;16(3):317-30. doi: 10.1006/mpev.2000.0785. Mol Phylogenet Evol. 2000. PMID: 10991785 Review.
Cited by
-
The effectiveness of position- and composition-specific gap costs for protein similarity searches.Bioinformatics. 2008 Jul 1;24(13):i15-23. doi: 10.1093/bioinformatics/btn171. Bioinformatics. 2008. PMID: 18586708 Free PMC article.
-
DNA indels in coding regions reveal selective constraints on protein evolution in the human lineage.BMC Evol Biol. 2007 Oct 12;7:191. doi: 10.1186/1471-2148-7-191. BMC Evol Biol. 2007. PMID: 17935613 Free PMC article.
-
Faithful Interpretation of Protein Structures through Weighted Persistent Homology Improves Evolutionary Distance Estimation.Mol Biol Evol. 2025 Feb 3;42(2):msae271. doi: 10.1093/molbev/msae271. Mol Biol Evol. 2025. PMID: 39761698 Free PMC article.
-
PC_ali: a tool for improved multiple alignments and evolutionary inference based on a hybrid protein sequence and structure similarity score.Bioinformatics. 2023 Nov 1;39(11):btad630. doi: 10.1093/bioinformatics/btad630. Bioinformatics. 2023. PMID: 37847775 Free PMC article.
-
Quantification of Inter-Sample Differences in T-Cell Receptor Repertoires Using Sequence-Based Information.Front Immunol. 2017 Nov 15;8:1500. doi: 10.3389/fimmu.2017.01500. eCollection 2017. Front Immunol. 2017. PMID: 29187849 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous