Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul;31(7):1880-93.
doi: 10.1093/molbev/msu126. Epub 2014 Apr 7.

Quantifying position-dependent codon usage bias

Affiliations

Quantifying position-dependent codon usage bias

Adam J Hockenberry et al. Mol Biol Evol. 2014 Jul.

Abstract

Although the mapping of codon to amino acid is conserved across nearly all species, the frequency at which synonymous codons are used varies both between organisms and between genes from the same organism. This variation affects diverse cellular processes including protein expression, regulation, and folding. Here, we mathematically model an additional layer of complexity and show that individual codon usage biases follow a position-dependent exponential decay model with unique parameter fits for each codon. We use this methodology to perform an in-depth analysis on codon usage bias in the model organism Escherichia coli. Our methodology shows that lowly and highly expressed genes are more similar in their codon usage patterns in the 5'-gene regions, but that these preferences diverge at distal sites resulting in greater positional dependency (pD, which we mathematically define later) for highly expressed genes. We show that position-dependent codon usage bias is partially explained by the structural requirements of mRNAs that results in increased usage of A/T rich codons shortly after the gene start. However, we also show that the pD of 4- and 6-fold degenerate codons is partially related to the gene copy number of cognate-tRNAs supporting existing hypotheses that posit benefits to a region of slow translation in the beginning of coding sequences. Lastly, we demonstrate that viewing codon usage bias through a position-dependent framework has practical utility by improving accuracy of gene expression prediction when incorporating positional dependencies into the Codon Adaptation Index model.

Keywords: coding sequence evolution; codon adaptation; codon usage bias; gene expression.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.
Fig. 1.
Codon usage bias is not uniform with regard to intragenic position. (A) This cartoon schematic shows one codon that is used evenly throughout the toy gene set (codon a, blue) and one codon that is not (codon b, orange). To statistically verify this, we align all genes at the 5′-region, group each codon into position-dependent bins, compare codon usage in each bin to random expectation, and sum the deviations over all bins. (B) Squared z scores of codon usage for Escherichia coli as a function of position. Codons on the y axis are grouped according to the amino acid they code for and are labeled red if their usage bias is significantly nonuniform (formula image). Results for each bin are depicted according to the quadratically scaled color bar, and the ten bins are arranged from 5′ to 3′.
F<sc>ig</sc>. 2.
Fig. 2.
The functional form of codon usage bias. (A) For the amino acid phenylalanine, we show the conditional probability of observing a codon as a function of position (black line, smoothed with a sliding window of eight codons). We also show the best-fitting exponential model (red) with corresponding 95% confidence intervals (pink) and the uniform model (cyan, confidence intervals not shown for clarity). The survival curve of Escherichia coli gene lengths is highlighted at the top to illustrate the basis for increasingly wide-confidence intervals due to data sparseness at distal sites. (B) Data for three different 2-fold redundant amino acids as in (A) but with the x axis extending only to 100 codons to highlight heterogeneity in the 5′ region.
F<sc>ig</sc>. 3.
Fig. 3.
The effect of gene expression on position-dependent codon usage bias. (A) Illustration of the pD metric and exponential parameters. (B) pD of codons in the genes of low- and high-abundance proteins split according to codon prevalence (top) and third position base (bottom). We observe a significant difference in absolute pD of the codons between the two gene sets and differences within each gene set according to rare and abundant codons. Within gene sets, we also observed significant differences in pD between codons that end in A/T versus those that end in G/C. (C) For each codon, we took the absolute difference in codon probabilities between the low- and high-abundance protein data sets and did so at two different points, the beginning of sequences and the median. Shown are the cumulative distributions of these differences.
F<sc>ig</sc>. 4.
Fig. 4.
The link between codon usage bias and mRNA structure. (A) We folded a 200mer (−50 to +150 nt, relative to the start codon) region for each gene in the high abundance protein set and extracted the individual base pair probabilities. For clarity, we illustrate median pair probabilities relative to the null model created by synonymous shuffling within genes (green). Actual genes (blue) and an alternative gene set created by shuffling synonymous codons between genes in a manner that preserves positional biases (red) have significantly less structure in the 5′ region (Wilcoxon rank-sum test on raw data, formula image for all cases illustrated). (B) We calculated the effect on folding energy of single synonymous codon substitutions in the genes of high abundance proteins. Left: The effect of substitutions in the 5′ region (−36 to +36 nt, relative to the start codon) is variable depending on the nature of the codon. Right: The same analysis for a region distal to the start codon (+36 to 108 nt). For all cases illustrated, error bars represent standard error of the mean and formula image according to Wilcoxon rank-sum test.
F<sc>ig</sc>. 5.
Fig. 5.
pD in codon groups and its association with cognate-tRNA gene copy number. For all 4-fold redundant amino acids, we group codons into separate sets under the assumption that single tRNA species are more likely to read codons within these groupings according to wobble-base pairing than between groupings. We illustrate conditional probabilities as in figure 2 and highlight the gene copy number of the cognate tRNAs for each group (tRNAGCN) to show that codons read by the rarer tRNAs are enriched in the 5′ region.
F<sc>ig</sc>. 6.
Fig. 6.
Accounting for position-dependent codon usage leads to superior estimates of gene expression levels. (A) Our model posits that selection for reduced mRNA structure around the start codon acts strongly on all sequences relative to disruptive processes such as genetic drift and mutational biases. However, preference for accurate and efficient translation is a second and weaker effect that is largely apparent in highly expressed genes and becomes stronger distal sites. (B) Rather than to calculate the CAI for each gene, we aligned genes at the start codon and calculated the CAI score for each position in either the reference set or genome. The dip in adaptedness after the start codon for both data sets (blue) is corrected by using exponential fits to the codon usage in the reference set (red). (C) For two data sets of transcript abundances (Taniguchi et al. 2010; Shiroguchi et al. 2012) and two data sets of protein abundances (Lu et al. 2007; Taniguchi et al. 2010), we show that the R2 correlation coefficient between the CAI and gene expression data is increased when using exponential fits to calculate the CAI as opposed to the traditional uniform assumption. Top, raw values; bottom, % increase. Error bars show standard deviation from 10,000 bootstrap resampled sets (paired t-test, formula image for all cases).
F<sc>ig</sc>. 7.
Fig. 7.
Position-dependent codon usage bias in multiple organisms. (A) The observed log odds ratios for the exponential decay model fits relative to uniform model for different organisms. (B) The distribution of τ values for E. coli and P. aeruginosa highlights potential differences in the evolutionary forces that have shaped the respective genomes.

Similar articles

Cited by

References

    1. Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr. 1974;19:716–723.
    1. Bahir I, Fromer M, Prat Y, Linial M. Viral adaptation to host: a proteome-based analysis of codon usage and amino acid preferences. Mol Syst Biol. 2009;5:311. - PMC - PubMed
    1. Bentele K, Saffert P, Rauscher R, Ignatova Z, Blüthgen N. Efficient translation initiation dictates codon usage at gene start. Mol Syst Biol. 2013;9:1–10. - PMC - PubMed
    1. Bulmer M. Codon usage and intragenic position. J Theor Biol. 1988;133:67–71. - PubMed
    1. Burnham KP. Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res. 2004;33:261–304.

Publication types

LinkOut - more resources