. 2001;2(4):RESEARCH0010.

doi: 10.1186/gb-2001-2-4-research0010. Epub 2001 Mar 22.

A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes

R D Knight¹, S J Freeland, L F Landweber

Affiliations

PMID: 11305938
PMCID: PMC31479
DOI: 10.1186/gb-2001-2-4-research0010

A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes

R D Knight et al. Genome Biol. 2001.

. 2001;2(4):RESEARCH0010.

doi: 10.1186/gb-2001-2-4-research0010. Epub 2001 Mar 22.

Authors

R D Knight¹, S J Freeland, L F Landweber

Affiliation

¹ Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ 08544, USA.

PMID: 11305938
PMCID: PMC31479
DOI: 10.1186/gb-2001-2-4-research0010

Abstract

Background: Correlations between genome composition (in terms of GC content) and usage of particular codons and amino acids have been widely reported, but poorly explained. We show here that a simple model of processes acting at the nucleotide level explains codon usage across a large sample of species (311 bacteria, 28 archaea and 257 eukaryotes). The model quantitatively predicts responses (slope and intercept of the regression line on genome GC content) of individual codons and amino acids to genome composition.

Results: Codons respond to genome composition on the basis of their GC content relative to their synonyms (explaining 71-87% of the variance in response among the different codons, depending on measure). Amino-acid responses are determined by the mean GC content of their codons (explaining 71-79% of the variance). Similar trends hold for genes within a genome. Position-dependent selection for error minimization explains why individual bases respond differently to directional mutation pressure.

Conclusions: Our model suggests that GC content drives codon usage (rather than the converse). It unifies a large body of empirical evidence concerning relationships between GC content and amino-acid or codon usage in disparate systems. The relationship between GC content and codon and amino-acid usage is ahistorical; it is replicated independently in the three domains of living organisms, reinforcing the idea that genes and genomes at mutation/selection equilibrium reproduce a unique relationship between nucleic acid and protein composition. Thus, the model may be useful in predicting amino-acid or nucleotide sequences in poorly characterized taxa.

PubMed Disclaimer

Figures

**Figure 1**
Only some codons and amino acids respond to GC content. **(a)** Plot of codon frequency within coding sequences versus total GC content, for the arginine codons CGA (white squares) and CGC (black circles) in bacteria and archaea. Linear regression lines are shown in black for CGC and gray for CGA. **(b)** A similar plot for the amino acids threonine (white squares) and arginine (black circles) in bacteria and archaea. The plots show that whereas CGC and arginine clearly correlate with GC content, CGA and threonine do not. The three relevant parameters for the response, slope, intercept and correlation coefficient, are all highly correlated with each other (see Table 1).

**Figure 2**
Codon and amino-acid responses are determined by their individual GC content. **(a)** Plot of response to GC content (here, the slope of the regression of absolute frequency in coding sequences on genome GC content) versus composition of the 21 codon sets (20 amino acids and termination) for archaea/bacteria (black symbols, thick lines) and eukaryotes (white symbols, thin lines). **(b)** A similar plot for the 64 codons. Note that, of the three measures of response, the slope is the least highly correlated with codon or amino-acid composition (see Table 2). For amino acids the composition is the mean GC content of their codons (a). For codons (b,c) the composition is the difference (ΔGC) between the codon's GC content and the mean GC content for all codons encoding the corresponding amino acid. **(c)** A response-composition plot of the 64 codons showing response within genomes rather than between them, for a bacterium (*Synechocystis*, black symbols, thick line), an archaean (*Archaeoglobus*, gray symbols, gray line), and a eukaryote *(Drosophila*, white symbols, dashed line). The gray line is almost coincident with the thick line; the points are clustered along the abscissa because the structure of the code restricts the possible GC content of the codon sets.

**Figure 3**
The codon responseto genome GC content varies with position. A re-plot of GC3 versus GC1, GC2 from [4], using the additional sequence data now available. Each point represents an organism, classified by domain: archaea, gray; bacteria, black; eukaryotes, white. GC1, diamonds; GC2, squares. Lines are model I least-squares regressions. Where GC3 = 0%, the remaining %GC in position 1 and position 2 is assumed to represent constant sites (that is, those fixed by selection to remain G or C). Similarly, where GC3 = 100%, the remaining %AT in position 1 and position 2 is assumed to represent constant sites where A or T have been fixed.

**Figure 4**
Predicted versus actual responses for sets of codons with identical composition. Each line is the sum of eight codons with the same GC content (by position). Each solid circle is a species. Lines of open circles are the theoretical predictions based on the four-parameter model. **(a)** All-GC (blue) and all-AT (red) codons in prokaryotes. **(b)** Codons with two G or C and one A or T, the minority base being at the first (blue), second (green), or third (red) position. Note that the third-position slope is actually of opposite sign to the first- and second-position slopes. The orange line is what would be expected if there were no position dependence (that is, P(GC)²P(AT) as in [56]). **(c)** As in (b), but for codons with two A or T and one G or C. In this case, the orange line is P(AT)²P(GC). **(d)** As in (c), but for eukaryotes. **(e)** As in (d), but now each point is a randomly chosen gene in *Drosophila*.

**Figure 5**
Comparison of predicted versus actual codon responses. Both bacteria/archaea (black) and eukaryotes (white) show a very good fit between the model and the data (in this case, predicted slopes along the x axis and actual slopes along the y axis). The slope is 1 and passes through the origin in both cases, indicating that the model is an unbiased predictor of codon usage trends. See Table 4 for other comparisons.

See this image and copyright information in PMC

Cited by

Codon usage bias in prokaryotic pyrimidine-ending codons is associated with the degeneracy of the encoded amino acids.
Wald N, Alroy M, Botzman M, Margalit H. Wald N, et al. Nucleic Acids Res. 2012 Aug;40(15):7074-83. doi: 10.1093/nar/gks348. Epub 2012 May 11. Nucleic Acids Res. 2012. PMID: 22581775 Free PMC article.
Investigating the predictability of essential genes across distantly related organisms using an integrative approach.
Deng J, Deng L, Su S, Zhang M, Lin X, Wei L, Minai AA, Hassett DJ, Lu LJ. Deng J, et al. Nucleic Acids Res. 2011 Feb;39(3):795-807. doi: 10.1093/nar/gkq784. Epub 2010 Sep 24. Nucleic Acids Res. 2011. PMID: 20870748 Free PMC article.
Effects of Arbovirus Multi-Host Life Cycles on Dinucleotide and Codon Usage Patterns.
Sexton NR, Ebel GD. Sexton NR, et al. Viruses. 2019 Jul 12;11(7):643. doi: 10.3390/v11070643. Viruses. 2019. PMID: 31336898 Free PMC article. Review.
The coexistence of the nucleosome positioning code with the genetic code on eukaryotic genomes.
Cohanim AB, Haran TE. Cohanim AB, et al. Nucleic Acids Res. 2009 Oct;37(19):6466-76. doi: 10.1093/nar/gkp689. Epub 2009 Aug 21. Nucleic Acids Res. 2009. PMID: 19700771 Free PMC article.
From local structure to a global framework: recognition of protein folds.
Joseph AP, de Brevern AG. Joseph AP, et al. J R Soc Interface. 2014 Apr 16;11(95):20131147. doi: 10.1098/rsif.2013.1147. Print 2014 Jun 6. J R Soc Interface. 2014. PMID: 24740960 Free PMC article. Review.

See all "Cited by" articles

References

1. Sueoka N. Compositional correlation between deoxyribonucleic acid and protein. Cold Spring Harb Symp Quant Biol. 1961;26:35–43. - PubMed
1. CUTG (Codon Usage Tabulated from GenBank) http://www.kazusa.or.jp/codon
1. Sueoka N. On the genetic basis of variation and heterogeneity of DNA base composition. Proc Natl Acad Sci USA. 1962;48:582–592. - PMC - PubMed
1. Sueoka N. Directional mutation pressure and neutral molecular evolution. Proc Natl Acad Sci USA. 1988;85:2653–2657. - PMC - PubMed
1. Kimura M. On the probability of fixation of mutant genes in populations. Genetics. 1962;47:713–719. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes

Affiliation

A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous