A model of evolution and structure for multiple sequence alignment

Ari Löytynoja¹, Nick Goldman

Affiliations

PMID: 18852103
PMCID: PMC2592536
DOI: 10.1098/rstb.2008.0170

A model of evolution and structure for multiple sequence alignment

Ari Löytynoja et al. Philos Trans R Soc Lond B Biol Sci. 2008.

. 2008 Dec 27;363(1512):3913-9.

doi: 10.1098/rstb.2008.0170.

Authors

Ari Löytynoja¹, Nick Goldman

Affiliation

¹ EMBL-European Bioinformatics Institute, Hinxton, UK. ari@ebi.ac.uk

PMID: 18852103
PMCID: PMC2592536
DOI: 10.1098/rstb.2008.0170

Abstract

We have developed a phylogeny-aware progressive alignment method that recognizes insertions and deletions as distinct evolutionary events and thus avoids systematic errors created by traditional alignment methods. We now extend this method to simultaneously model regional heterogeneity and evolution. This novel method can be flexibly adapted to alignment of nucleotide or amino acid sequences evolving under processes that vary over genomic regions and, being fully probabilistic, provides an estimate of regional heterogeneity of the evolutionary process along the alignment and a measure of local reliability of the solution. Furthermore, the evolutionary modelling of substitution process permits adjusting the sensitivity and specificity of the alignment and, if high specificity is aimed at, leaving sequences unaligned when their divergence is beyond a meaningful detection of homology.

PubMed Disclaimer

Figures

**Figure 1**
The simplest non-homogeneous alignment model consists of non-emitting start and end states (light grey circles; S and E) and two structure classes (grey boxes; 1 and 2), each describing an evolutionary process of its own. Moves between structure classes and moves within a structure class are denoted with grey and black arrows, respectively. For clarity, the moves from character emitting states (white circles; X_i, Y_i and M_i) back to a non-emitting linker state (light grey; W_i) are drawn via a dummy state (light grey, empty circles).

**Figure 2**
(a) A multiple alignment is built from pairwise alignments performed in order of decreasing relatedness (, and ), each alignment describing the ancestral node for the two nodes (extant or ancestral sequences) to be aligned. (b) The substitution process in each structure class is described by an instantaneous rate matrix Q_i, here indicated by plots and showing the rates between different nucleotides as relative sizes of bubbles. In this example, structure classes 1 and 2 model regions of DNA sequence that evolve at the rate that is 150 and 50 per cent of the average rate, respectively. (c) For each pairwise alignment, indicated by different shades in the tree (a), substitution probability matrices for every structure class are computed from the corresponding matrix Q_i. The evolutionary divergence between the sequence/ancestral node pairs to be aligned varies, as shown by the relative length of horizontal bars in the tree, and the alignments contain unequal amounts of information to distinguish the two evolutionary processes. (i) Between human and chimpanzee, both fast and slowly evolving regions (left and right matrix, respectively) are mostly conserved and the diagonal bubbles indicating no change are dominant. In the alignment of (ii) primate ancestor to mouse and (iii) mammalian ancestor to chicken, the fast evolving regions (left matrix) contain greater numbers of substitutions and the off-diagonal bubbles are relatively bigger.

formula image — **Figure 2**
(a) A multiple alignment is built from pairwise alignments performed in order of decreasing relatedness (, and ), each alignment describing the ancestral node for the two nodes (extant or ancestral sequences) to be aligned. (b) The substitution process in each structure class is described by an instantaneous rate matrix Q_i, here indicated by plots and showing the rates between different nucleotides as relative sizes of bubbles. In this example, structure classes 1 and 2 model regions of DNA sequence that evolve at the rate that is 150 and 50 per cent of the average rate, respectively. (c) For each pairwise alignment, indicated by different shades in the tree (a), substitution probability matrices for every structure class are computed from the corresponding matrix Q_i. The evolutionary divergence between the sequence/ancestral node pairs to be aligned varies, as shown by the relative length of horizontal bars in the tree, and the alignments contain unequal amounts of information to distinguish the two evolutionary processes. (i) Between human and chimpanzee, both fast and slowly evolving regions (left and right matrix, respectively) are mostly conserved and the diagonal bubbles indicating no change are dominant. In the alignment of (ii) primate ancestor to mouse and (iii) mammalian ancestor to chicken, the fast evolving regions (left matrix) contain greater numbers of substitutions and the off-diagonal bubbles are relatively bigger.

**Figure 3**
The panels in (a)–(c) show the posterior probability of different structure classes (top) across the full alignment and (bottom) around the known protein-coding exons. In (a) and (b), the models fast/slow and codon are used to align the human and mouse sequences; in (c), the model codon to align fifteen mammalian sequences. Light grey, dark grey and black represent the structure states modelling fast and slowly evolving sites and protein-coding regions, respectively. In (c), the addition of more distantly related sequences (dark grey and light grey frames in the tree correspond to panels in (i) and (ii) respectively) increases the evolutionary information and the high posterior probability for the codon states (in black) more accurately matches the locations of known exons. The known locations of the coding exons are marked with black bars (top). The empty gaps in the plots indicate insertions in some other part of the tree.

See this image and copyright information in PMC

Cited by

Linking genomics and ecology to investigate the complex evolution of an invasive Drosophila pest.
Ometto L, Cestaro A, Ramasamy S, Grassi A, Revadi S, Siozios S, Moretto M, Fontana P, Varotto C, Pisani D, Dekker T, Wrobel N, Viola R, Pertot I, Cavalieri D, Blaxter M, Anfora G, Rota-Stabelli O. Ometto L, et al. Genome Biol Evol. 2013;5(4):745-57. doi: 10.1093/gbe/evt034. Genome Biol Evol. 2013. PMID: 23501831 Free PMC article.
webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser.
Löytynoja A, Goldman N. Löytynoja A, et al. BMC Bioinformatics. 2010 Nov 26;11:579. doi: 10.1186/1471-2105-11-579. BMC Bioinformatics. 2010. PMID: 21110866 Free PMC article.
PhyloSim - Monte Carlo simulation of sequence evolution in the R statistical computing environment.
Sipos B, Massingham T, Jordan GE, Goldman N. Sipos B, et al. BMC Bioinformatics. 2011 Apr 19;12:104. doi: 10.1186/1471-2105-12-104. BMC Bioinformatics. 2011. PMID: 21504561 Free PMC article.
Analysis of the recombination landscape of hexaploid bread wheat reveals genes controlling recombination and gene conversion frequency.
Gardiner LJ, Wingen LU, Bailey P, Joynson R, Brabbs T, Wright J, Higgins JD, Hall N, Griffiths S, Clavijo BJ, Hall A. Gardiner LJ, et al. Genome Biol. 2019 Apr 15;20(1):69. doi: 10.1186/s13059-019-1675-6. Genome Biol. 2019. PMID: 30982471 Free PMC article.
Putting hornets on the genomic map.
Favreau E, Cini A, Taylor D, Câmara Ferreira F, Bentley MA, Cappa F, Cervo R, Privman E, Schneider J, Thiéry D, Mashoodh R, Wyatt CDR, Brown RL, Bodrug-Schepers A, Stralis-Pavese N, Dohm JC, Mead D, Himmelbauer H, Guigo R, Sumner S. Favreau E, et al. Sci Rep. 2023 Apr 21;13(1):6232. doi: 10.1038/s41598-023-31932-x. Sci Rep. 2023. PMID: 37085574 Free PMC article.

See all "Cited by" articles

References

1. Arribas-Gil, A., Metzler, D. & Plouhinec, J.-L. 2007 Statistical alignment with a sequence evolution model allowing rate heterogeneity along the sequence, IEEE/ACM Trans. Comput. Biol. Bioinform 29 Aug 2007, IEEE Computer Society Digital Library. (doi:10.1109/TCBB.2007.70246) - DOI - PubMed
1. Durbin R, Eddy S, Krogh A, Mitchison G. Cambridge University Press; Cambridge, UK: 1998. Biological sequence analysis: probabilistic models of proteins and nucleic acids.
1. Eddy S. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. doi:10.1093/bioinformatics/14.9.755 - DOI - PubMed
1. Edgar R, Sjölander K. SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics. 2003;19:1404–1411. doi:10.1093/bioinformatics/btg158 - DOI - PubMed
1. Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982;162:705–708. doi:10.1016/0022-2836(82)90398-9 - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A model of evolution and structure for multiple sequence alignment

Affiliation

A model of evolution and structure for multiple sequence alignment

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources