Mind the gaps: evidence of bias in estimates of multiple sequence alignments
- PMID: 17709332
- DOI: 10.1093/molbev/msm176
Mind the gaps: evidence of bias in estimates of multiple sequence alignments
Abstract
Multiple sequence alignment (MSA) is a crucial first step in the analysis of genomic and proteomic data. Commonly occurring sequence features, such as deletions and insertions, are known to affect the accuracy of MSA programs, but the extent to which alignment accuracy is affected by the positions of insertions and deletions has not been examined independently of other sources of sequence variation. We assessed the performance of 6 popular MSA programs (ClustalW, DIALIGN-T, MAFFT, MUSCLE, PROBCONS, and T-COFFEE) and one experimental program, PRANK, on amino acid sequences that differed only by short regions of deleted residues. The analysis showed that the absence of residues often led to an incorrect placement of gaps in the alignments, even though the sequences were otherwise identical. In data sets containing sequences with partially overlapping deletions, most MSA programs preferentially aligned the gaps vertically at the expense of incorrectly aligning residues in the flanking regions. Of the programs assessed, only DIALIGN-T was able to place overlapping gaps correctly relative to one another, but this was usually context dependent and was observed only in some of the data sets. In data sets containing sequences with non-overlapping deletions, both DIALIGN-T and MAFFT (G-INS-I) were able to align gaps with near-perfect accuracy, but only MAFFT produced the correct alignment consistently. The same was true for data sets that comprised isoforms of alternatively spliced gene products: both DIALIGN-T and MAFFT produced highly accurate alignments, with MAFFT being the more consistent of the 2 programs. Other programs, notably T-COFFEE and ClustalW, were less accurate. For all data sets, alignments produced by different MSA programs differed markedly, indicating that reliance on a single MSA program may give misleading results. It is therefore advisable to use more than one MSA program when dealing with sequences that may contain deletions or insertions, particularly for high-throughput and pipeline applications where manual refinement of each alignment is not practicable.
Similar articles
-
Assessing the efficiency of multiple sequence alignment programs.Algorithms Mol Biol. 2014 Mar 6;9(1):4. doi: 10.1186/1748-7188-9-4. Algorithms Mol Biol. 2014. PMID: 24602402 Free PMC article.
-
Improvement in the accuracy of multiple sequence alignment program MAFFT.Genome Inform. 2005;16(1):22-33. Genome Inform. 2005. PMID: 16362903
-
The accuracy of several multiple sequence alignment programs for proteins.BMC Bioinformatics. 2006 Oct 24;7:471. doi: 10.1186/1471-2105-7-471. BMC Bioinformatics. 2006. PMID: 17062146 Free PMC article.
-
Upcoming challenges for multiple sequence alignment methods in the high-throughput era.Bioinformatics. 2009 Oct 1;25(19):2455-65. doi: 10.1093/bioinformatics/btp452. Epub 2009 Jul 30. Bioinformatics. 2009. PMID: 19648142 Free PMC article. Review.
-
Petabase-Scale Homology Search for Structure Prediction.Cold Spring Harb Perspect Biol. 2024 May 2;16(5):a041465. doi: 10.1101/cshperspect.a041465. Cold Spring Harb Perspect Biol. 2024. PMID: 38316555 Review.
Cited by
-
Evidence of animal mtDNA recombination between divergent populations of the potato cyst nematode Globodera pallida.Genetica. 2012 Mar;140(1-3):19-29. doi: 10.1007/s10709-012-9651-z. Epub 2012 May 11. Genetica. 2012. PMID: 22576954
-
A Comparison of Three Molecular Markers for the Identification of Populations of Globodera pallida.J Nematol. 2012 Mar;44(1):7-17. J Nematol. 2012. PMID: 23482966 Free PMC article.
-
Improved phylogenetic analyses corroborate a plausible position of Martialis heureka in the ant tree of life.PLoS One. 2011;6(6):e21031. doi: 10.1371/journal.pone.0021031. Epub 2011 Jun 24. PLoS One. 2011. PMID: 21731644 Free PMC article.
-
Gene classification based on amino acid motifs and residues: the DLX (distal-less) test case.PLoS One. 2009 Jun 1;4(6):e5748. doi: 10.1371/journal.pone.0005748. PLoS One. 2009. PMID: 19484130 Free PMC article.
-
Widespread purifying selection on RNA structure in mammals.Nucleic Acids Res. 2013 Sep;41(17):8220-36. doi: 10.1093/nar/gkt596. Epub 2013 Jul 11. Nucleic Acids Res. 2013. PMID: 23847102 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources