Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Sep 22:8:255.
doi: 10.1186/1471-2148-8-255.

Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs

Affiliations

Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs

Julien Dutheil et al. BMC Evol Biol. .

Abstract

Background: Accurately modeling the sequence substitution process is required for the correct estimation of evolutionary parameters, be they phylogenetic relationships, substitution rates or ancestral states; it is also crucial to simulate realistic data sets. Such simulation procedures are needed to estimate the null-distribution of complex statistics, an approach referred to as parametric bootstrapping, and are also used to test the quality of phylogenetic reconstruction programs. It has often been observed that homologous sequences can vary widely in their nucleotide or amino-acid compositions, revealing that sequence evolution has changed importantly among lineages, and may therefore be most appropriately approached through non-homogeneous models. Several programs implementing such models have been developed, but they are limited in their possibilities: only a few particular models are available for likelihood optimization, and data sets cannot be easily generated using the resulting estimated parameters.

Results: We hereby present a general implementation of non-homogeneous models of substitutions. It is available as dedicated classes in the Bio++ libraries and can hence be used in any C++ program. Two programs that use these classes are also presented. The first one, Bio++ Maximum Likelihood (BppML), estimates parameters of any non-homogeneous model and the second one, Bio++ Sequence Generator (BppSeqGen), simulates the evolution of sequences from these models. These programs allow the user to describe non-homogeneous models through a property file with a simple yet powerful syntax, without any programming required.

Conclusion: We show that the general implementation introduced here can accommodate virtually any type of non-homogeneous models of sequence evolution, including heterotachous ones, while being computer efficient. We furthermore illustrate the use of such general models for parametric bootstrapping, using tests of non-homogeneity applied to an already published ribosomal RNA data set.

PubMed Disclaimer

Figures

Figure 1
Figure 1
General non-homogeneous model of substitution. The substitution model depicted here is Tamura's 1992 model of substitution, which contains two parameters: κ, the transitions/transversions ratio and θ the equilibrium G+C content. In the homogeneous case, θ and κ are constant over the tree (case 'a'). In Galtier and Gouy's 1998 model, κ is constant over the tree and one distinct θ is allowed per branch (case 'b'). Between these two extrema lay models with certain branches, but not all, sharing a common value of θ (case 'c'). In the most general case 'd', there are two sets of parameters, one for κ and another for θ, that are shared by the branches of the tree.
Figure 2
Figure 2
Relations between branches, models and parameters. In the general non-homogeneous case, model parameters are shared by different branches across the tree. These parameters are part of branch-specific substitution models, which specify branch-wise probabilities of replacement between states. Branches are here defined according to their rightmost node. The SubstitutionModelSet class stores dependencies between nodes, models and parameters.
Figure 3
Figure 3
Model specification in BppML and BppSeqGen. A general file format is introduced to allow for the user-friendly description of virtually any non-homogeneous model. The tree shows the nodes identifiers, which can be obtained from the programs or defined by the user in its own program. Each case presented here corresponds to a particular model in figure 1, and was labeled accordingly. Each parameter can be fixed to a specific value or optimized with BppML.
Figure 4
Figure 4
Assessing parameter estimation using simulations. Left: boxes show the median and quartiles of the distribution of parameter estimates for 100 simulations. The 'true' value used in the simulation is shown in red. Right: boxes show the distribution of the bias (estimated value – real value), as a function of the (pooled) real values of the branch length. θ*: GC content, ω GC content at root, α : shape of the Gamma distribution of rates across sites, p proportion of invariant sites (see text for details on the model used).
Figure 5
Figure 5
Distributions of the Bowker's test statistics under various models. First column: number of pairwise tests significant at the 5% level. Second column: median of the pairwise statistics. First row: homogeneous model (H). Second row: one theta per branch non-homogeneous model (NH1). Third row: 3 thetas non-homogeneous model (NH2). Fourth row: 5 thetas non-homogeneous model (NH3). Fifth row: one kappa per branch non-homogeneous model (NH4). All models use the Tamura 1992 substitution model with a 4-classes discrete Gamma + invariant rate distribution. The arrows indicate the observed values from the real data set and the resulting p-values.

References

    1. Williams PD, Pollock DD, Blackburne BP, Goldstein RA. Assessing the accuracy of ancestral protein reconstruction methods. PLoS Comput Biol. 2006;2:e69–e69. doi: 10.1371/journal.pcbi.0020069. - DOI - PMC - PubMed
    1. Goldman N. Statistical tests of models of DNA substitution. J Mol Evol. 1993;36:182–198. doi: 10.1007/BF00166252. - DOI - PubMed
    1. Kuhner MK, Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol. 1994;11:459–468. - PubMed
    1. Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52:696–704. doi: 10.1080/10635150390235520. - DOI - PubMed
    1. Lopez P, Casane D, Philippe H. Heterotachy, an important process of protein evolution. Mol Biol Evol. 2002;19:1–7. - PubMed