. 2014 Jul;31(7):1880-93.

doi: 10.1093/molbev/msu126. Epub 2014 Apr 7.

Quantifying position-dependent codon usage bias

Adam J Hockenberry¹, M Irmak Sirer², Luís A Nunes Amaral³, Michael C Jewett⁴

Affiliations

¹ Department of Chemical and Biological Engineering, Northwestern UniversityInterdepartmental Program in Biological Sciences, Northwestern University.
² Department of Chemical and Biological Engineering, Northwestern University.
³ Department of Chemical and Biological Engineering, Northwestern UniversityNorthwestern Institute on Complex Systems, Northwestern UniversityHoward Hughes Medical Institute, Northwestern University.
⁴ Department of Chemical and Biological Engineering, Northwestern UniversityInterdepartmental Program in Biological Sciences, Northwestern UniversityNorthwestern Institute on Complex Systems, Northwestern UniversityChemistry of Life Processes Institute, Northwestern UniversityInstitute for BioNanotechnology and Medicine, Northwestern University m-jewett@northwestern.edu.

PMID: 24710515
PMCID: PMC4069614
DOI: 10.1093/molbev/msu126

Quantifying position-dependent codon usage bias

Adam J Hockenberry et al. Mol Biol Evol. 2014 Jul.

. 2014 Jul;31(7):1880-93.

doi: 10.1093/molbev/msu126. Epub 2014 Apr 7.

Authors

Adam J Hockenberry¹, M Irmak Sirer², Luís A Nunes Amaral³, Michael C Jewett⁴

Affiliations

¹ Department of Chemical and Biological Engineering, Northwestern UniversityInterdepartmental Program in Biological Sciences, Northwestern University.
² Department of Chemical and Biological Engineering, Northwestern University.
³ Department of Chemical and Biological Engineering, Northwestern UniversityNorthwestern Institute on Complex Systems, Northwestern UniversityHoward Hughes Medical Institute, Northwestern University.
⁴ Department of Chemical and Biological Engineering, Northwestern UniversityInterdepartmental Program in Biological Sciences, Northwestern UniversityNorthwestern Institute on Complex Systems, Northwestern UniversityChemistry of Life Processes Institute, Northwestern UniversityInstitute for BioNanotechnology and Medicine, Northwestern University m-jewett@northwestern.edu.

PMID: 24710515
PMCID: PMC4069614
DOI: 10.1093/molbev/msu126

Abstract

Although the mapping of codon to amino acid is conserved across nearly all species, the frequency at which synonymous codons are used varies both between organisms and between genes from the same organism. This variation affects diverse cellular processes including protein expression, regulation, and folding. Here, we mathematically model an additional layer of complexity and show that individual codon usage biases follow a position-dependent exponential decay model with unique parameter fits for each codon. We use this methodology to perform an in-depth analysis on codon usage bias in the model organism Escherichia coli. Our methodology shows that lowly and highly expressed genes are more similar in their codon usage patterns in the 5'-gene regions, but that these preferences diverge at distal sites resulting in greater positional dependency (pD, which we mathematically define later) for highly expressed genes. We show that position-dependent codon usage bias is partially explained by the structural requirements of mRNAs that results in increased usage of A/T rich codons shortly after the gene start. However, we also show that the pD of 4- and 6-fold degenerate codons is partially related to the gene copy number of cognate-tRNAs supporting existing hypotheses that posit benefits to a region of slow translation in the beginning of coding sequences. Lastly, we demonstrate that viewing codon usage bias through a position-dependent framework has practical utility by improving accuracy of gene expression prediction when incorporating positional dependencies into the Codon Adaptation Index model.

Keywords: coding sequence evolution; codon adaptation; codon usage bias; gene expression.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1. — **Fig. 1.**
Codon usage bias is not uniform with regard to intragenic position. (A) This cartoon schematic shows one codon that is used evenly throughout the toy gene set (codon a, blue) and one codon that is not (codon b, orange). To statistically verify this, we align all genes at the 5′-region, group each codon into position-dependent bins, compare codon usage in each bin to random expectation, and sum the deviations over all bins. (B) Squared z scores of codon usage for *Escherichia coli* as a function of position. Codons on the y axis are grouped according to the amino acid they code for and are labeled red if their usage bias is significantly nonuniform (). Results for each bin are depicted according to the quadratically scaled color bar, and the ten bins are arranged from 5′ to 3′.

formula image — **Fig. 1.**
Codon usage bias is not uniform with regard to intragenic position. (A) This cartoon schematic shows one codon that is used evenly throughout the toy gene set (codon a, blue) and one codon that is not (codon b, orange). To statistically verify this, we align all genes at the 5′-region, group each codon into position-dependent bins, compare codon usage in each bin to random expectation, and sum the deviations over all bins. (B) Squared z scores of codon usage for *Escherichia coli* as a function of position. Codons on the y axis are grouped according to the amino acid they code for and are labeled red if their usage bias is significantly nonuniform (). Results for each bin are depicted according to the quadratically scaled color bar, and the ten bins are arranged from 5′ to 3′.

F<sc>ig</sc>. 2. — **Fig. 2.**
The functional form of codon usage bias. (A) For the amino acid phenylalanine, we show the conditional probability of observing a codon as a function of position (black line, smoothed with a sliding window of eight codons). We also show the best-fitting exponential model (red) with corresponding 95% confidence intervals (pink) and the uniform model (cyan, confidence intervals not shown for clarity). The survival curve of *Escherichia coli* gene lengths is highlighted at the top to illustrate the basis for increasingly wide-confidence intervals due to data sparseness at distal sites. (B) Data for three different 2-fold redundant amino acids as in (A) but with the x axis extending only to 100 codons to highlight heterogeneity in the 5′ region.

F<sc>ig</sc>. 3. — **Fig. 3.**
The effect of gene expression on position-dependent codon usage bias. (A) Illustration of the pD metric and exponential parameters. (B) pD of codons in the genes of low- and high-abundance proteins split according to codon prevalence (top) and third position base (bottom). We observe a significant difference in absolute pD of the codons between the two gene sets and differences within each gene set according to rare and abundant codons. Within gene sets, we also observed significant differences in pD between codons that end in A/T versus those that end in G/C. (C) For each codon, we took the absolute difference in codon probabilities between the low- and high-abundance protein data sets and did so at two different points, the beginning of sequences and the median. Shown are the cumulative distributions of these differences.

F<sc>ig</sc>. 4. — **Fig. 4.**
The link between codon usage bias and mRNA structure. (A) We folded a 200mer (−50 to +150 nt, relative to the start codon) region for each gene in the high abundance protein set and extracted the individual base pair probabilities. For clarity, we illustrate median pair probabilities relative to the null model created by synonymous shuffling within genes (green). Actual genes (blue) and an alternative gene set created by shuffling synonymous codons between genes in a manner that preserves positional biases (red) have significantly less structure in the 5′ region (Wilcoxon rank-sum test on raw data, for all cases illustrated). (B) We calculated the effect on folding energy of single synonymous codon substitutions in the genes of high abundance proteins. Left: The effect of substitutions in the 5′ region (−36 to +36 nt, relative to the start codon) is variable depending on the nature of the codon. Right: The same analysis for a region distal to the start codon (+36 to 108 nt). For all cases illustrated, error bars represent standard error of the mean and according to Wilcoxon rank-sum test.

F<sc>ig</sc>. 5. — **Fig. 5.**
pD in codon groups and its association with cognate-tRNA gene copy number. For all 4-fold redundant amino acids, we group codons into separate sets under the assumption that single tRNA species are more likely to read codons within these groupings according to wobble-base pairing than between groupings. We illustrate conditional probabilities as in figure 2 and highlight the gene copy number of the cognate tRNAs for each group (tRNA_GCN) to show that codons read by the rarer tRNAs are enriched in the 5′ region.

F<sc>ig</sc>. 6. — **Fig. 6.**
Accounting for position-dependent codon usage leads to superior estimates of gene expression levels. (A) Our model posits that selection for reduced mRNA structure around the start codon acts strongly on all sequences relative to disruptive processes such as genetic drift and mutational biases. However, preference for accurate and efficient translation is a second and weaker effect that is largely apparent in highly expressed genes and becomes stronger distal sites. (B) Rather than to calculate the CAI for each gene, we aligned genes at the start codon and calculated the CAI score for each position in either the reference set or genome. The dip in adaptedness after the start codon for both data sets (blue) is corrected by using exponential fits to the codon usage in the reference set (red). (C) For two data sets of transcript abundances (Taniguchi et al. 2010; Shiroguchi et al. 2012) and two data sets of protein abundances (Lu et al. 2007; Taniguchi et al. 2010), we show that the R² correlation coefficient between the CAI and gene expression data is increased when using exponential fits to calculate the CAI as opposed to the traditional uniform assumption. Top, raw values; bottom, % increase. Error bars show standard deviation from 10,000 bootstrap resampled sets (paired t-test, for all cases).

F<sc>ig</sc>. 7. — **Fig. 7.**
Position-dependent codon usage bias in multiple organisms. (A) The observed log odds ratios for the exponential decay model fits relative to uniform model for different organisms. (B) The distribution of τ values for *E. coli* and *P. aeruginosa* highlights potential differences in the evolutionary forces that have shaped the respective genomes.

See this image and copyright information in PMC

Cited by

Codon usage pattern and predicted gene expression in Arabidopsis thaliana.
Sahoo S, Das SS, Rakshit R. Sahoo S, et al. Gene X. 2019 Mar 6;2:100012. doi: 10.1016/j.gene.2019.100012. eCollection 2019 Jun. Gene X. 2019. PMID: 32550546 Free PMC article.
Using the Mutation-Selection Framework to Characterize Selection on Protein Sequences.
Teufel AI, Ritchie AM, Wilke CO, Liberles DA. Teufel AI, et al. Genes (Basel). 2018 Aug 13;9(8):409. doi: 10.3390/genes9080409. Genes (Basel). 2018. PMID: 30104502 Free PMC article. Review.
Intragenomic variation in non-adaptive nucleotide biases causes underestimation of selection on synonymous codon usage.
Cope AL, Shah P. Cope AL, et al. PLoS Genet. 2022 Jun 17;18(6):e1010256. doi: 10.1371/journal.pgen.1010256. eCollection 2022 Jun. PLoS Genet. 2022. PMID: 35714134 Free PMC article.
A novel framework for evaluating the performance of codon usage bias metrics.
Liu SS, Hockenberry AJ, Jewett MC, Amaral LAN. Liu SS, et al. J R Soc Interface. 2018 Jan;15(138):20170667. doi: 10.1098/rsif.2017.0667. J R Soc Interface. 2018. PMID: 29386398 Free PMC article.
Leveraging genome-wide datasets to quantify the functional role of the anti-Shine-Dalgarno sequence in regulating translation efficiency.
Hockenberry AJ, Pah AR, Jewett MC, Amaral LA. Hockenberry AJ, et al. Open Biol. 2017 Jan;7(1):160239. doi: 10.1098/rsob.160239. Open Biol. 2017. PMID: 28100663 Free PMC article.

See all "Cited by" articles

References

1. Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr. 1974;19:716–723.
1. Bahir I, Fromer M, Prat Y, Linial M. Viral adaptation to host: a proteome-based analysis of codon usage and amino acid preferences. Mol Syst Biol. 2009;5:311. - PMC - PubMed
1. Bentele K, Saffert P, Rauscher R, Ignatova Z, Blüthgen N. Efficient translation initiation dictates codon usage at gene start. Mol Syst Biol. 2013;9:1–10. - PMC - PubMed
1. Bulmer M. Codon usage and intragenic position. J Theor Biol. 1988;133:67–71. - PubMed
1. Burnham KP. Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res. 2004;33:261–304.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Quantifying position-dependent codon usage bias

Affiliations

Quantifying position-dependent codon usage bias

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources