Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug 19;6(8):e1000885.
doi: 10.1371/journal.pcbi.1000885.

CodonTest: modeling amino acid substitution preferences in coding sequences

Affiliations

CodonTest: modeling amino acid substitution preferences in coding sequences

Wayne Delport et al. PLoS Comput Biol. .

Abstract

Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Simulation studies used to derive the appropriate penalty term for .
Each panel plots the difference in log likelihood (formula image) normalized by the logarithm of the sample size (number of characters), between best fitting GA models with formula image and formula image rates (formula image), against the number of sites in the alignment. For simulations with a single rate class we plotted formula image, top right. Figures for multiple rate simulations (2–5 rates) show formula image as black dots (left column); and formula image as blue dots (right column). Values to the right of row report simulated rates for each class. The left column is a reflection of power, whereas the right column – of the degree of over-fitting. For the case where a single rate was simulated, the degree of over-fitting is the rate of false positives. The desired behavior for formula image is achieved when the model with formula image rate classes is preferred to models with formula image, and formula image rate classes. For a modified BIC criterion formula image with formula image, the former happens if formula image (more definitively with increasing sample size), and the latter if formula image (regardless of sample size).
Figure 2
Figure 2. Evolutionary rate estimation as “curve fitting.”
An example from HIV-1 polymerase gene alignment for which the formula image inferred 7 non-synonymous rate classes. The idealized biological rate distribution (unobservable) is depicted by the dashed line. The goodness of fit, the complexity of the models, and the range of maximum likelihood parameter estimates are listed in the table.
Figure 3
Figure 3. Neighbor-joining trees built from matrices of pairwise substitution spectrum distances (Eq. 2) computed between different models fitted to the HIV-1 group M pol alignment, and between models inferred from different alignments.
Figure 4
Figure 4. Evolutionary rate clusters in structured GA models () inferred from the HIV-1 group M pol alignment.
Each cluster is labeled with the maximum likelihood estimate of its rate inferred under formula image. The residues (nodes) are annotated by their biochemical properties and Stanfel class, and the rates (edges) are labeled with model-averaged (formula image) rate estimates. The style of an edge is determined by its cluster affinity, where high cluster affinities indicate that a large proportion of models in the credible set were consistent with the structured formula image model.
Figure 5
Figure 5. Evolutionary rate clusters in structured GA models () inferred from the vertebrate rhodopsin protein alignment.
Each cluster is labeled with the maximum likelihood estimate of its rate inferred under formula image. The residues (nodes) are annotated by their biochemical properties and Stanfel class, and the rates (edges) are labeled with model-averaged (formula image) rate estimates. The style of an edge is determined by its cluster affinity, where high cluster affinities indicate that a large proportion of models in the credible set were consistent with the structured formula image model.
Figure 6
Figure 6. Correlations of lower substitution rates and property preservation in the HIV-1 group M pol alignment.
Model-averaged formula image rates were stratified by whether or not they involved a change in polarity, charge or Stanfel class, the medians of two rate distributions were compared using a one sided Wilcoxon rank-sum test. We further correlated the magnitude of substitution rates with one of five property-based distances between the corresponding residues (defined in [18]) using a one-sided (negative correlation) Pearson product-moment correlation test.
Figure 7
Figure 7. Correlations of lower substitution rates and property preservation in the vertebrate rhodopsin alignment.
Model-averaged formula image rates were stratified by whether or not they involved a change in polarity, charge or Stanfel class, the medians of two rate distributions were compared using a one sided Wilcoxon rank-sum test. We further correlated the magnitude of substitution rates with one of five property-based distances between the corresponding residues (defined in [18]) using a one-sided (negative correlation) Pearson product-moment correlation test.

Similar articles

Cited by

References

    1. Felsenstein J. Evolutionary trees from DNA-sequences – a maximum-likelihood approach. J Mol Evol. 1981;17:368–376. - PubMed
    1. Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994;11:715–724. - PubMed
    1. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11:725–736. - PubMed
    1. Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol. 2009;26:255–271. - PubMed
    1. Delport W, Scheffler K, Seoighe C. Models of coding sequence evolution. Brief Bioinform. 2009;10:97–109. - PMC - PubMed

Publication types