Quantifying the impact of dependent evolution among sites in phylogenetic inference

Chris A Nasrallah¹, David H Mathews, John P Huelsenbeck

Affiliations

Affiliation

¹ Department of Integrative Biology, University of California, Berkeley, 3060 Valley Life Sciences Building #3140, Berkeley, CA 94720-3140, USA. nasrallah@berkeley.edu

PMID: 21081481
PMCID: PMC2997629
DOI: 10.1093/sysbio/syq074

Quantifying the impact of dependent evolution among sites in phylogenetic inference

Chris A Nasrallah et al. Syst Biol. 2011 Jan.

. 2011 Jan;60(1):60-73.

doi: 10.1093/sysbio/syq074. Epub 2010 Nov 15.

Authors

Chris A Nasrallah¹, David H Mathews, John P Huelsenbeck

Affiliation

¹ Department of Integrative Biology, University of California, Berkeley, 3060 Valley Life Sciences Building #3140, Berkeley, CA 94720-3140, USA. nasrallah@berkeley.edu

PMID: 21081481
PMCID: PMC2997629
DOI: 10.1093/sysbio/syq074

Abstract

Nearly all commonly used methods of phylogenetic inference assume that characters in an alignment evolve independently of one another. This assumption is attractive for simplicity and computational tractability but is not biologically reasonable for RNAs and proteins that have secondary and tertiary structures. Here, we simulate RNA and protein-coding DNA sequence data under a general model of dependence in order to assess the robustness of traditional methods of phylogenetic inference to violation of the assumption of independence among sites. We find that the accuracy of independence-assuming methods is reduced by the dependence among sites; for proteins this reduction is relatively mild, but for RNA this reduction may be substantial. We introduce the concept of effective sequence length and its utility for considering information content in phylogenetics.

PubMed Disclaimer

Figures

F<sc>IGURE</sc> 1. — **FIGURE 1.**
The four-taxon tree. The ratio of the branch lengths a/b and the total tree length V are the parameters of interest. As a/b becomes large, the inference problem becomes increasingly difficult.

F<sc>IGURE</sc> 2. — **FIGURE 2.**
Three methods for simulating data under independence. a) Using matrix exponentiation is intractable for dependent data. b) Simulating a character history can be done with context dependency for an entire sequence, but drawing from the stationary distribution at the root node is still problematic. c) Evolve into stationarity by simulating a very long character history before reaching the root, then continuing up the tree as in (b).

F<sc>IGURE</sc> 3. — **FIGURE 3.**
Energies sampled every 100 substitutions from a continuously evolving sequence. a) Independence among sites. Energies sampled are similar to that initially sampled at random. b) Dependence due to structural constraint. Low energies indicate that sequences sampled are those that fit the structure. The sequence evolves from a randomly sampled starting state of high energy to sample those states of low energy that correspond to the structure.

F<sc>IGURE</sc> 4. — **FIGURE 4.**
Number of substitutions at sites of varying constraint. Under independence RNA stem and loop sites experience similar rates of substitutions (a), but under dependence stem sites observe fewer and loop sites observe more substitutions (b). For proteins, under independence, all amino acid sites experience similar rates of substitution (c), whereas under dependence the rate of substitution is inversely proportional to the number of other sites with which the given site is in contact (d).

F<sc>IGURE</sc> 5. — **FIGURE 5.**
The accuracy of independence-assuming phylogenetic methods to infer the correct topology using RNA sequences constrained by structure simulated on a tree of total length V = 1.75. As the level of dependence in the data (z) increases, the methods are increasingly unable to infer the correct topology. This is especially true as the branch length ratio (a/b) becomes large and the problem becomes difficult. Structures: *Bombyx mori* R2 element reverse transcriptase 3' UTR (R2) [300 nucleotides, 400 replicates] and 5S rRNA (5S) [119 nucleotides, 1000 replicates]. Methods: maximum likelihood GTR+Γ (ML), neighbor-joining using ML distances (NJ), parsimony (MP).

F<sc>IGURE</sc> 6. — **FIGURE 6.**
The total tree length (V) effects the accuracy of ML on simulated RNA sequences constrained by structure (R2). Dependence in the data (bottom curves) reduces the accuracy relative to independence (top curves), and this effect is more pronounced when the underlying tree is larger. a) V = 0.25. b) V = 1.0. c) V = 1.75. Qualitatively similar results were obtained for other independence-assuming methods and levels of dependence.

F<sc>IGURE</sc> 7. — **FIGURE 7.**
Accuracy of phylogenetic inference using ML using sequences generated under varying levels of dependence due to protein structure constraints: solubility (s) and pairwise interactions (p). Accuracy is reduced when dependence is strong and tree length is large. All panels represent the same tree topology (a/b = 5). Structures: mammalian myoglobin (MYO) [459 nucleotides, 1000 replicates] and 6-hydroxymethyl-7-8-dihydroxypterin pyrophosphokinase (PKA) [474 nucleotides, 500 replicates]. a) MYO, V = 1.3. b) MYO, V = 2.08. c) PKA, V = 1.3. d) PKA, V = 2.08.

F<sc>IGURE</sc> 8. — **FIGURE 8.**
The effective sequence length (L_e) as a means of quantifying the phylogenetic information content of a sequence that contains dependence. All panels represent a fixed tree length (V = 1.75) and level of dependence (z = 0.1). a) a/b = 0.5. b) a/b = 2. c) a/b = 3. d) a/b = 4. e) a/b = 6. f) a/b = 8. The plotted curves indicate the accuracy of ML on these trees using independent data of varying lengths or the expected accuracy if the data were independent. The accuracy of ML on the simulated RNA sequences (R2, actual length = 300 nucleotides) on each topology is shown by the horizontal lines. Where these horizontal lines cross the curve, they drop to the x-axis to estimate the effective sequence length: the length of independent neutral sequence that displays the same amount of error in estimation that the actual dependence-containing sequence displays.

F<sc>IGURE</sc> 9. — **FIGURE 9.**
Accuracy of independence-assuming phylogenetic methods for 22-taxon simulations of RNA constrained by structure (R2; 500 replicates). The Robinson–Founds distance metric compares the estimated tree to the true tree for data sets under varying levels of dependence (z). For all methods, small amounts of dependence introduce error in tree estimation. a) ML, b) neighbor-joining, c) parsimony.

See this image and copyright information in PMC

References

1. Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic substitution models. Mol. Biol. Evol. 2009;26:255–271. - PubMed
1. Bastolla U, Farwer J, Knapp EW, Vendruscolo M. How to guarantee optimal stability for most representative structures in the protein data bank. Proteins. 2001;44:79–96. - PubMed
1. Castoe TA, de Konig APJ, Kim H-M, Gu W, Noonan BP, Naylor G, Jiang ZJ, Parkinson CL, Pollock DD. Evidence for an ancient adaptive episode of convergent molecular evolution. Proc. Natl. Acad. Sci. U.S.A. 2009;106:8986–8991. - PMC - PubMed
1. Choi SC, Hobolth A, Robinson DM, Kishino H, Thorne JL. Quantifying the impact of protein tertiary structure on molecular evolution. Mol. Biol. Evol. 2007;24:1769–1782. - PubMed
1. Felsenstein J. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 1978;27:401–411.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Quantifying the impact of dependent evolution among sites in phylogenetic inference

Affiliation

Quantifying the impact of dependent evolution among sites in phylogenetic inference

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources