Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Apr 19;1(1):6.
doi: 10.1186/1748-7188-1-6.

Multiple sequence alignment with user-defined anchor points

Affiliations

Multiple sequence alignment with user-defined anchor points

Burkhard Morgenstern et al. Algorithms Mol Biol. .

Abstract

Background: Automated software tools for multiple alignment often fail to produce biologically meaningful results. In such situations, expert knowledge can help to improve the quality of alignments.

Results: Herein, we describe a semi-automatic version of the alignment program DIALIGN that can take pre-defined constraints into account. It is possible for the user to specify parts of the sequences that are assumed to be homologous and should therefore be aligned to each other. Our software program can use these sites as anchor points by creating a multiple alignment respecting these constraints. This way, our alignment method can produce alignments that are biologically more meaningful than alignments produced by fully automated procedures. As a demonstration of how our method works, we apply our approach to genomic sequences around the Hox gene cluster and to a set of DNA-binding proteins. As a by-product, we obtain insights about the performance of the greedy algorithm that our program uses for multiple alignment and about the underlying objective function. This information will be useful for the further development of DIALIGN. The described alignment approach has been integrated into the TRACKER software system.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Possible mis-alignments caused by tandem duplications in the segment-based alignment approach (DIALIGN). We assume that various instances of a motif are contained in the input sequence set and that the degree of similarity among the different instances is approximately equal. For simplicity, we also assume that the sequences do not share any similarity outside the conserved motif. Lines connecting the sequences denote fragments identified by DIALIGN in the respective pairwise alignment procedures. (A) If a tandem duplication occurs in two sequences, the correct alignment will be found since the algorithm identifies a chain of local alignments with maximum total score. (B) If a motif is duplicated in one sequence but only one instance M2 is contained in the second sequence, it may happen that M2 is split up and aligned to different instances of the motif in the first sequence. (C) If the motif is duplicated in the first sequence but only one instance of it is contained in sequences two and three, respectively, consistency conflicts can occur. In this case, local similarities identified in the respective pairwise alignments cannot be integrated into one single output alignment. To select a consistent subset of these pairwise similarities, DIALIGN uses a greedy heuristic. Depending on the degree of similarity among the instances of the motif, the greedy approach may lead to serious mis-alignments (D).
Figure 2
Figure 2
The pufferfish Takifugu rubripes has seven Hox clusters of which we use four in our computational example. The Evx gene, another homedomain transcription factor is usually liked with the Hox genes and can be considered as part of the Hox cluster. The paralogy groups are indicated. Filled boxes indicates intact Hox genes, the open box indicates a HoxA7a pseudogene [45].
Figure 3
Figure 3
Result of a DIALIGN run on the Hox sequences from Figure 2 without anchoring. The diagram represents sequences and gene positions to scale. All incorrectly aligned segments (defined as parts of a gene that are aligned with parts of gene from a different paralogy group) are indicated by lines between the sequences.
Figure 4
Figure 4
Anchored and non-anchored alignment of a set of protein sequences with known 3D structure (data set lr69 from BAliBASE [38]). Three core blocks for which the 'correct' alignment is known are shown in red, blue and green. (A) Alignment calculated by DIALIGN with default options. Most of the core blocks are mis-aligned. (B) Alignment calculated by DIALIGN with anchoring option. The first position of the third block has been used as anchor point, i.e. the program has been forced to align this column correctly. The rest of the sequences is automatically aligned by DIALIGN given the constraints defined by this anchor point. Although only one single column has been used for anchoring, the tree blocks are almost perfectly aligned.

References

    1. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research. 1994;22:4673–4680. - PMC - PubMed
    1. Morgenstern B. DIALIGN: Multiple DNA and Protein Sequence Alignment at BiBiServ. Nucleic Acids Research. 2004;32:W33–W36. doi: 10.1093/nar/gnh029. - DOI - PMC - PubMed
    1. Notredame C, Higgins D, Heringa J. T-Coffee: a novel algorithm for multiple sequence alignment. J Mol Biol. 2000;302:205–217. doi: 10.1006/jmbi.2000.4042. - DOI - PubMed
    1. Notredame C. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics. 2002;3:131–144. doi: 10.1517/14622416.3.1.131. - DOI - PubMed
    1. Lee C, Grasso C, Sharlow MF. Multiple sequence alignment using partial order graphs. Bioinformatics. 2002;18:452–464. doi: 10.1093/bioinformatics/18.3.452. - DOI - PubMed