Multiple sequence alignment with user-defined anchor points

Burkhard Morgenstern¹, Sonja J Prohaska, Dirk Pöhler, Peter F Stadler

Affiliations

Affiliation

¹ Universität Göttingen, Institut für Mikrobiologie und Genetik, Abteilung für Bioinformatik, Goldschmidtstrasse, 1, D-37077 Göttingen, Germany. burkhard@gobics.de

PMID: 16722533
PMCID: PMC1481597
DOI: 10.1186/1748-7188-1-6

Multiple sequence alignment with user-defined anchor points

Burkhard Morgenstern et al. Algorithms Mol Biol. 2006.

. 2006 Apr 19;1(1):6.

doi: 10.1186/1748-7188-1-6.

Authors

Burkhard Morgenstern¹, Sonja J Prohaska, Dirk Pöhler, Peter F Stadler

Affiliation

¹ Universität Göttingen, Institut für Mikrobiologie und Genetik, Abteilung für Bioinformatik, Goldschmidtstrasse, 1, D-37077 Göttingen, Germany. burkhard@gobics.de

PMID: 16722533
PMCID: PMC1481597
DOI: 10.1186/1748-7188-1-6

Abstract

Background: Automated software tools for multiple alignment often fail to produce biologically meaningful results. In such situations, expert knowledge can help to improve the quality of alignments.

Results: Herein, we describe a semi-automatic version of the alignment program DIALIGN that can take pre-defined constraints into account. It is possible for the user to specify parts of the sequences that are assumed to be homologous and should therefore be aligned to each other. Our software program can use these sites as anchor points by creating a multiple alignment respecting these constraints. This way, our alignment method can produce alignments that are biologically more meaningful than alignments produced by fully automated procedures. As a demonstration of how our method works, we apply our approach to genomic sequences around the Hox gene cluster and to a set of DNA-binding proteins. As a by-product, we obtain insights about the performance of the greedy algorithm that our program uses for multiple alignment and about the underlying objective function. This information will be useful for the further development of DIALIGN. The described alignment approach has been integrated into the TRACKER software system.

PubMed Disclaimer

Figures

**Figure 1**
Possible mis-alignments caused by tandem duplications in the segment-based alignment approach (DIALIGN). We assume that various instances of a motif are contained in the input sequence set and that the degree of similarity among the different instances is approximately equal. For simplicity, we also assume that the sequences do not share any similarity outside the conserved motif. Lines connecting the sequences denote fragments identified by DIALIGN in the respective pairwise alignment procedures. (A) If a tandem duplication occurs in two sequences, the correct alignment will be found since the algorithm identifies a *chain* of local alignments with maximum *total* score. (B) If a motif is duplicated in one sequence but only one instance M₂is contained in the second sequence, it may happen that M₂is split up and aligned to different instances of the motif in the first sequence. (C) If the motif is duplicated in the first sequence but only one instance of it is contained in sequences two and three, respectively, *consistency* conflicts can occur. In this case, local similarities identified in the respective pairwise alignments cannot be integrated into one single output alignment. To select a consistent subset of these pairwise similarities, DIALIGN uses a *greedy* heuristic. Depending on the degree of similarity among the instances of the motif, the greedy approach may lead to serious mis-alignments (D).

**Figure 2**
The pufferfish *Takifugu rubripes* has seven *Hox* clusters of which we use four in our computational example. The *Evx* gene, another homedomain transcription factor is usually liked with the *Hox* genes and can be considered as part of the *Hox* cluster. The paralogy groups are indicated. Filled boxes indicates intact *Hox* genes, the open box indicates a *HoxA7a* pseudogene [45].

**Figure 3**
Result of a DIALIGN run on the *Hox* sequences from Figure 2 without anchoring. The diagram represents sequences and gene positions to scale. All incorrectly aligned segments (defined as parts of a gene that are aligned with parts of gene from a different paralogy group) are indicated by lines between the sequences.

**Figure 4**
Anchored and non-anchored alignment of a set of protein sequences with known 3D structure (data set lr69 from BAliBASE [38]). Three *core blocks* for which the 'correct' alignment is known are shown in red, blue and green. **(A)** Alignment calculated by DIALIGN with default options. Most of the core blocks are mis-aligned. **(B)** Alignment calculated by DIALIGN with *anchoring* option. The first position of the third block has been used as anchor point, i.e. the program has *been forced* to align this column correctly. The rest of the sequences is automatically aligned by DIALIGN given the constraints defined by this anchor point. Although only one single column has been used for anchoring, the tree blocks are almost perfectly aligned.

See this image and copyright information in PMC

References

1. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research. 1994;22:4673–4680. - PMC - PubMed
1. Morgenstern B. DIALIGN: Multiple DNA and Protein Sequence Alignment at BiBiServ. Nucleic Acids Research. 2004;32:W33–W36. doi: 10.1093/nar/gnh029. - DOI - PMC - PubMed
1. Notredame C, Higgins D, Heringa J. T-Coffee: a novel algorithm for multiple sequence alignment. J Mol Biol. 2000;302:205–217. doi: 10.1006/jmbi.2000.4042. - DOI - PubMed
1. Notredame C. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics. 2002;3:131–144. doi: 10.1517/14622416.3.1.131. - DOI - PubMed
1. Lee C, Grasso C, Sharlow MF. Multiple sequence alignment using partial order graphs. Bioinformatics. 2002;18:452–464. doi: 10.1093/bioinformatics/18.3.452. - DOI - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Multiple sequence alignment with user-defined anchor points

Affiliation

Multiple sequence alignment with user-defined anchor points

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials