Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan;60(1):60-73.
doi: 10.1093/sysbio/syq074. Epub 2010 Nov 15.

Quantifying the impact of dependent evolution among sites in phylogenetic inference

Affiliations

Quantifying the impact of dependent evolution among sites in phylogenetic inference

Chris A Nasrallah et al. Syst Biol. 2011 Jan.

Abstract

Nearly all commonly used methods of phylogenetic inference assume that characters in an alignment evolve independently of one another. This assumption is attractive for simplicity and computational tractability but is not biologically reasonable for RNAs and proteins that have secondary and tertiary structures. Here, we simulate RNA and protein-coding DNA sequence data under a general model of dependence in order to assess the robustness of traditional methods of phylogenetic inference to violation of the assumption of independence among sites. We find that the accuracy of independence-assuming methods is reduced by the dependence among sites; for proteins this reduction is relatively mild, but for RNA this reduction may be substantial. We introduce the concept of effective sequence length and its utility for considering information content in phylogenetics.

PubMed Disclaimer

Figures

F<sc>IGURE</sc> 1.
FIGURE 1.
The four-taxon tree. The ratio of the branch lengths a/b and the total tree length V are the parameters of interest. As a/b becomes large, the inference problem becomes increasingly difficult.
F<sc>IGURE</sc> 2.
FIGURE 2.
Three methods for simulating data under independence. a) Using matrix exponentiation is intractable for dependent data. b) Simulating a character history can be done with context dependency for an entire sequence, but drawing from the stationary distribution at the root node is still problematic. c) Evolve into stationarity by simulating a very long character history before reaching the root, then continuing up the tree as in (b).
F<sc>IGURE</sc> 3.
FIGURE 3.
Energies sampled every 100 substitutions from a continuously evolving sequence. a) Independence among sites. Energies sampled are similar to that initially sampled at random. b) Dependence due to structural constraint. Low energies indicate that sequences sampled are those that fit the structure. The sequence evolves from a randomly sampled starting state of high energy to sample those states of low energy that correspond to the structure.
F<sc>IGURE</sc> 4.
FIGURE 4.
Number of substitutions at sites of varying constraint. Under independence RNA stem and loop sites experience similar rates of substitutions (a), but under dependence stem sites observe fewer and loop sites observe more substitutions (b). For proteins, under independence, all amino acid sites experience similar rates of substitution (c), whereas under dependence the rate of substitution is inversely proportional to the number of other sites with which the given site is in contact (d).
F<sc>IGURE</sc> 5.
FIGURE 5.
The accuracy of independence-assuming phylogenetic methods to infer the correct topology using RNA sequences constrained by structure simulated on a tree of total length V = 1.75. As the level of dependence in the data (z) increases, the methods are increasingly unable to infer the correct topology. This is especially true as the branch length ratio (a/b) becomes large and the problem becomes difficult. Structures: Bombyx mori R2 element reverse transcriptase 3' UTR (R2) [300 nucleotides, 400 replicates] and 5S rRNA (5S) [119 nucleotides, 1000 replicates]. Methods: maximum likelihood GTR+Γ (ML), neighbor-joining using ML distances (NJ), parsimony (MP).
F<sc>IGURE</sc> 6.
FIGURE 6.
The total tree length (V) effects the accuracy of ML on simulated RNA sequences constrained by structure (R2). Dependence in the data (bottom curves) reduces the accuracy relative to independence (top curves), and this effect is more pronounced when the underlying tree is larger. a) V = 0.25. b) V = 1.0. c) V = 1.75. Qualitatively similar results were obtained for other independence-assuming methods and levels of dependence.
F<sc>IGURE</sc> 7.
FIGURE 7.
Accuracy of phylogenetic inference using ML using sequences generated under varying levels of dependence due to protein structure constraints: solubility (s) and pairwise interactions (p). Accuracy is reduced when dependence is strong and tree length is large. All panels represent the same tree topology (a/b = 5). Structures: mammalian myoglobin (MYO) [459 nucleotides, 1000 replicates] and 6-hydroxymethyl-7-8-dihydroxypterin pyrophosphokinase (PKA) [474 nucleotides, 500 replicates]. a) MYO, V = 1.3. b) MYO, V = 2.08. c) PKA, V = 1.3. d) PKA, V = 2.08.
F<sc>IGURE</sc> 8.
FIGURE 8.
The effective sequence length (Le) as a means of quantifying the phylogenetic information content of a sequence that contains dependence. All panels represent a fixed tree length (V = 1.75) and level of dependence (z = 0.1). a) a/b = 0.5. b) a/b = 2. c) a/b = 3. d) a/b = 4. e) a/b = 6. f) a/b = 8. The plotted curves indicate the accuracy of ML on these trees using independent data of varying lengths or the expected accuracy if the data were independent. The accuracy of ML on the simulated RNA sequences (R2, actual length = 300 nucleotides) on each topology is shown by the horizontal lines. Where these horizontal lines cross the curve, they drop to the x-axis to estimate the effective sequence length: the length of independent neutral sequence that displays the same amount of error in estimation that the actual dependence-containing sequence displays.
F<sc>IGURE</sc> 9.
FIGURE 9.
Accuracy of independence-assuming phylogenetic methods for 22-taxon simulations of RNA constrained by structure (R2; 500 replicates). The Robinson–Founds distance metric compares the estimated tree to the true tree for data sets under varying levels of dependence (z). For all methods, small amounts of dependence introduce error in tree estimation. a) ML, b) neighbor-joining, c) parsimony.

References

    1. Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic substitution models. Mol. Biol. Evol. 2009;26:255–271. - PubMed
    1. Bastolla U, Farwer J, Knapp EW, Vendruscolo M. How to guarantee optimal stability for most representative structures in the protein data bank. Proteins. 2001;44:79–96. - PubMed
    1. Castoe TA, de Konig APJ, Kim H-M, Gu W, Noonan BP, Naylor G, Jiang ZJ, Parkinson CL, Pollock DD. Evidence for an ancient adaptive episode of convergent molecular evolution. Proc. Natl. Acad. Sci. U.S.A. 2009;106:8986–8991. - PMC - PubMed
    1. Choi SC, Hobolth A, Robinson DM, Kishino H, Thorne JL. Quantifying the impact of protein tertiary structure on molecular evolution. Mol. Biol. Evol. 2007;24:1769–1782. - PubMed
    1. Felsenstein J. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 1978;27:401–411.

Publication types