Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Dec 27;363(1512):3913-9.
doi: 10.1098/rstb.2008.0170.

A model of evolution and structure for multiple sequence alignment

Affiliations

A model of evolution and structure for multiple sequence alignment

Ari Löytynoja et al. Philos Trans R Soc Lond B Biol Sci. .

Abstract

We have developed a phylogeny-aware progressive alignment method that recognizes insertions and deletions as distinct evolutionary events and thus avoids systematic errors created by traditional alignment methods. We now extend this method to simultaneously model regional heterogeneity and evolution. This novel method can be flexibly adapted to alignment of nucleotide or amino acid sequences evolving under processes that vary over genomic regions and, being fully probabilistic, provides an estimate of regional heterogeneity of the evolutionary process along the alignment and a measure of local reliability of the solution. Furthermore, the evolutionary modelling of substitution process permits adjusting the sensitivity and specificity of the alignment and, if high specificity is aimed at, leaving sequences unaligned when their divergence is beyond a meaningful detection of homology.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The simplest non-homogeneous alignment model consists of non-emitting start and end states (light grey circles; S and E) and two structure classes (grey boxes; 1 and 2), each describing an evolutionary process of its own. Moves between structure classes and moves within a structure class are denoted with grey and black arrows, respectively. For clarity, the moves from character emitting states (white circles; Xi, Yi and Mi) back to a non-emitting linker state (light grey; Wi) are drawn via a dummy state (light grey, empty circles).
Figure 2
Figure 2
(a) A multiple alignment is built from pairwise alignments performed in order of decreasing relatedness (formula image, formula image and formula image), each alignment describing the ancestral node for the two nodes (extant or ancestral sequences) to be aligned. (b) The substitution process in each structure class is described by an instantaneous rate matrix Qi, here indicated by plots formula image and formula image showing the rates between different nucleotides as relative sizes of bubbles. In this example, structure classes 1 and 2 model regions of DNA sequence that evolve at the rate that is 150 and 50 per cent of the average rate, respectively. (c) For each pairwise alignment, indicated by different shades in the tree (a), substitution probability matrices for every structure class are computed from the corresponding matrix Qi. The evolutionary divergence between the sequence/ancestral node pairs to be aligned varies, as shown by the relative length of horizontal bars in the tree, and the alignments contain unequal amounts of information to distinguish the two evolutionary processes. (i) Between human and chimpanzee, both fast and slowly evolving regions (left and right matrix, respectively) are mostly conserved and the diagonal bubbles indicating no change are dominant. In the alignment of (ii) primate ancestor to mouse and (iii) mammalian ancestor to chicken, the fast evolving regions (left matrix) contain greater numbers of substitutions and the off-diagonal bubbles are relatively bigger.
Figure 3
Figure 3
The panels in (a)–(c) show the posterior probability of different structure classes (top) across the full alignment and (bottom) around the known protein-coding exons. In (a) and (b), the models fast/slow and codon are used to align the human and mouse sequences; in (c), the model codon to align fifteen mammalian sequences. Light grey, dark grey and black represent the structure states modelling fast and slowly evolving sites and protein-coding regions, respectively. In (c), the addition of more distantly related sequences (dark grey and light grey frames in the tree correspond to panels in (i) and (ii) respectively) increases the evolutionary information and the high posterior probability for the codon states (in black) more accurately matches the locations of known exons. The known locations of the coding exons are marked with black bars (top). The empty gaps in the plots indicate insertions in some other part of the tree.

Similar articles

Cited by

References

    1. Arribas-Gil, A., Metzler, D. & Plouhinec, J.-L. 2007 Statistical alignment with a sequence evolution model allowing rate heterogeneity along the sequence, IEEE/ACM Trans. Comput. Biol. Bioinform 29 Aug 2007, IEEE Computer Society Digital Library. (doi:10.1109/TCBB.2007.70246) - DOI - PubMed
    1. Durbin R, Eddy S, Krogh A, Mitchison G. Cambridge University Press; Cambridge, UK: 1998. Biological sequence analysis: probabilistic models of proteins and nucleic acids.
    1. Eddy S. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. doi:10.1093/bioinformatics/14.9.755 - DOI - PubMed
    1. Edgar R, Sjölander K. SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics. 2003;19:1404–1411. doi:10.1093/bioinformatics/btg158 - DOI - PubMed
    1. Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982;162:705–708. doi:10.1016/0022-2836(82)90398-9 - DOI - PubMed

LinkOut - more resources