Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2006 May 31:6:43.
doi: 10.1186/1471-2148-6-43.

A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the Protein Data Bank

Affiliations
Comparative Study

A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the Protein Data Bank

Ugo Bastolla et al. BMC Evol Biol. .

Abstract

Background: Since thermodynamic stability is a global property of proteins that has to be conserved during evolution, the selective pressure at a given site of a protein sequence depends on the amino acids present at other sites. However, models of molecular evolution that aim at reconstructing the evolutionary history of macromolecules become computationally intractable if such correlations between sites are explicitly taken into account.

Results: We introduce an evolutionary model with sites evolving independently under a global constraint on the conservation of structural stability. This model consists of a selection process, which depends on two hydrophobicity parameters that can be computed from protein sequences without any fit, and a mutation process for which we consider various models. It reproduces quantitatively the results of Structurally Constrained Neutral (SCN) simulations of protein evolution in which the stability of the native state is explicitly computed and conserved. We then compare the predicted site-specific amino acid distributions with those sampled from the Protein Data Bank (PDB). The parameters of the mutation model, whose number varies between zero and five, are fitted from the data. The mean correlation coefficient between predicted and observed site-specific amino acid distributions is larger than <r> = 0.70 for a mutation model with no free parameters and no genetic code. In contrast, considering only the mutation process with no selection yields a mean correlation coefficient of <r> = 0.56 with three fitted parameters. The mutation model that best fits the data takes into account increased mutation rate at CpG dinucleotides, yielding <r> = 0.90 with five parameters.

Conclusion: The effective selection process that we propose reproduces well amino acid distributions as observed in the protein sequences in the PDB. Its simplicity makes it very promising for likelihood calculations in phylogenetic studies. Interestingly, in this approach the mutation process influences the effective selection process, i.e. selection and mutation must be entangled in order to obtain effectively independent sites. This interdependence between mutation and selection reflects the deep influence that mutation has on the evolutionary process: The bias in the mutation influences the thermodynamic properties of the evolving proteins, in agreement with comparative studies of bacterial proteomes, and it also influences the rate of accepted mutations.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Correlation coefficient between the average HP and the PE for SCN simulations with various mutation models yielding different GC biases, for three single-domain proteins, lysozyme (PDB code 31zt, circles), phosphocarrier protein Hpr (PDB code 1opd, diamonds), and myoglobin (PDB code 1a6g, squares), and for the small two-domains protein ATP synthase ε unit (ATPE, PDB code 1aqt, triangles).
Figure 2
Figure 2
Comparison of the site-specific amino acid distribution πi(a) obtained from simulations of the SCN model for ATPE (PDB code 1aqt, full symbols) and from the mean-field model (lines and open symbols) at site i = 128 with ci/<c> = 0.43 [(a) and (b)] and at site i = 82 with ci/<c> = 1.55 [(c) and (d)]. The upper panels (a) and (c) show the case of high GC mutational bias, whereas the lower ones (b) and (d) show low GC mutational bias.
Figure 3
Figure 3
Acceptance probability for a mutation Pacc, calculated in SCN simulations for ATPE (PDB code 1aqt, symbols) and in the mean-field model (lines) for three different values of the transition to transversion ratio k as a function of the GC content, f(C) + f(G). The mutation model is such that P(C) = P(G) and P(T) = P(A), assuming type 2 parity rule [54].
Figure 4
Figure 4
Full symbols and lines indicate average properties of protein folding thermodynamics in SCN simulations, open symbols indicate the same quantities in the proteomes of different bacterial species [39]. The horizontal axis represents the GC mutation bias for SCN simulations and the GC content at third codon position of the bacterial genes, (a) Mean hydrophobicity. SCN results are rescaled by a factor 8.6 and correspond to three single-domain proteins, lysozyme (PDB code 31zt, circles), phosphocarrier protein Hpr (PDB code 1opd, diamonds), and myoglobin (PDB code 1a6g, squares), and for the small two-domain protein ATP synthase ε unit (ATPE, PDB code 1aqt, triangles), (b) Mean unfolding free energy. SCN results are rescaled by a factor 4.3. Only ATPE is represented, the other proteins being qualitatively equivalent. (c) Mean normalized energy gap. SCN results are rescaled by a factor 1.3. Only ATPE is represented, the other proteins being qualitatively equivalent.
Figure 5
Figure 5
Distributions of hydrophobicity related quantities from a non-redundant subset of the PDB: Mean hydrophobicity; Root mean square hydrophobicity; Ratio between mean and root mean square hydrophobicity, τ, ratio between τ = <h>/h2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabgMYiHlabdIgaOnaaCaaaleqabaGaeGOmaidaaOGaeyOkJepaleqaaaaa@32CC@ and W1 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0lbbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaadaGcaaqcaawaaiabdEfaxPWaaSbaaKqaGfaacqaIXaqmaeqaaaqabaaaaa@2F29@ = N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6eaobWcbeaaaaa@2DEC@<c>. All these quantities are narrowly distributed, with standard deviations of the order of less than 1/10 of the average value.
Figure 6
Figure 6
Observed and predicted site-specific amino acid distribution πci/<c> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaacciGae8hWda3aaSbaaSqaaiabdogaJnaaBaaameaacqWGPbqAcqGGVaWlcqGH8aapcqWGJbWycqGH+aGpaeqaaaWcbeaaaaa@45FE@(a), divided by the expected frequencies under mutation alone wAA(a), for (a) the mutation models 1 ('opt. freq. at β ≡ 0'), (b) mutation model 2 ('constant'), (c) mutation model 4 ('opt. freq.'), and (d) mutation model 5 ('CpG'). For the theoretical models where mutation satisfies detailed balance, πci MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFapaCdaWgaaWcbaGaem4yam2aaSbaaWqaaiabdMgaPbqabaaaleqaaaaa@317E@(a)/wAA(a) ∝ exp [-βi h(a)], so that the slope of the plot represents βi at this site class. For illustration, site class with ci/<c> ∈ [0.435, 0.545] was selected. Full symbols show the observed distributions obtained from sequences in the PDB, whereas the open symbols and the lines display the mean-field model.
Figure 7
Figure 7
Site-specific amino acid frequencies sampled from the PDB, πci/cobs MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFapaCdaqhaaWcbaGaem4yam2aaSbaaWqaaiabdMgaPbqabaWccqGGVaWlcqGHPms4cqWGJbWycqGHQms8aeaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbwvMCKfMBHbaceaGaa43Baiaa+jgacaGFZbaaaaaa@4486@(a), versus the probabilities πci/cpred MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFapaCdaqhaaWcbaGaem4yam2aaSbaaWqaaiabdMgaPbqabaWccqGGVaWlcqGHPms4cqWGJbWycqGHQms8aeaaieaacqGFWbaCcqGFYbGCcqGFLbqzcqGFKbazaaaaaa@3CA6@(a) predicted through the mean-field model with optimal mutation parameters. All amino acids and all sites are shown. Observed and predicted frequencies are divided by the frequencies expected under mutation alone wAA(a). The four frames refer to (a) mutation model 2 ('constant'), (b) mutation model 3 ('#codons'), (c) mutation model 4 ('opt. freq.'), and (d) mutation model 5 ('CpG'), respectively.

References

    1. Nei M, Kumar S. Molecular evolution and phylogenetics. Oxford Univ. Press; 2000.
    1. Graur D, Li WH. Fundamentals of molecular evolution. Sinauer, Sunderland; 2000. - PubMed
    1. Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981;17:368–376. doi: 10.1007/BF01734359. - DOI - PubMed
    1. Lockless SW, Ranganathan R. Evolutionarily Conserved Pathways of Energetic Connectivity in Protein Families. Science. 1999;286:295–299. doi: 10.1126/science.286.5438.295. - DOI - PubMed
    1. Socolich M, Lockless SW, Russ WP, Lee H, Gardner KH, Ranganathan R. Evolutionary information for specifying a protein fold. Nature. 2005;437:512–518. doi: 10.1038/nature03991. - DOI - PubMed

Publication types