Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jun;36(11):3819-27.
doi: 10.1093/nar/gkn288. Epub 2008 May 21.

SCUMBLE: a method for systematic and accurate detection of codon usage bias by maximum likelihood estimation

Affiliations

SCUMBLE: a method for systematic and accurate detection of codon usage bias by maximum likelihood estimation

Morten Kloster et al. Nucleic Acids Res. 2008 Jun.

Abstract

The genetic code is degenerate--most amino acids can be encoded by from two to as many as six different codons. The synonymous codons are not used with equal frequency: not only are some codons favored over others, but also their usage can vary significantly from species to species and between different genes in the same organism. Known causes of codon bias include differences in mutation rates as well as selection pressure related to the expression level of a gene, but the standard analysis methods can account for only a fraction of the observed codon usage variation. We here introduce an explicit model of codon usage bias, inspired by statistical physics. Combining this model with a maximum likelihood approach, we are able to clearly identify different sources of bias in various genomes. We have applied the algorithm to Saccharomyces cerevisiae as well as 325 prokaryote genomes, and in most cases our model explains essentially all observed variance.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(a) Cumulative histogram of the normalized variance for named genes in S. cerevisiae for models with various numbers of trends; actual genome (solid lines) compared to randomized genome (dotted lines). Models with 0 or 1 trend explain the data poorly, as the curve for the real genome is very different from that of a randomized genome, and there are many genes with very high normalized variance. (b) Average (black) and median (red) normalized variance for models with up to 10 trends.
Figure 2.
Figure 2.
Experimental values for cellular mRNA/protein levels plotted against the first offset/CAI value of each gene for S. cerevisiae. Several groups of highly expressed genes are plotted in different colors.
Figure 3.
Figure 3.
Median normalized variance for 325 prokaryote genomes, using models with 0–10 trends. The different genomes are slightly offset along the abscissa, in alphabetical order. The dotted brown line shows approximate median normalized variance for randomized genomes generated from the models (Supplementary Figure S7). Results for the average normalized variance are very similar, except that in rare but not exceptional cases, individual genes dominate the average due to extremely low estimated probabilities of using a specific codon which is, in fact, used.
Figure 4.
Figure 4.
A four-trend model of Helicobacter pylori. (a)–(c) GC3 or GT3 plotted against the first three offsets. Genes for ribosomal proteins are circled in red. The cumulative distributionsof the offsets are shown above each graph, for all genes (black) and for ribosomal genes (red). (d) β2 plotted against the number of the gene along the genome, with genes on different strands in different colors. The green and blue lines are 50-point running averages for strand 1 and 2, respectively.
Figure 5.
Figure 5.
Scatter plot of the first two axes from the four-trend model found by SCUMBLE (a), WCA (b) and CA/RSCU (c) for the genes of Anaeromyxobacter dehalogenans. Genes for ribosomal proteins are circled in red. In (b) and (c), most genes are clustered near the origin; only a small fraction of the genes have significantly negative abscissae.
Figure 6.
Figure 6.
Solid lines: number of prokaryote genomes (out of 325) for which the total fraction of the GC (a), GT (b), CT (c) or random (d) preference signal captured by the first n trends exceeds the abscissa, where n is given by the color. Total shaded area of each color is proportional to the average fraction of signal captured by the corresponding trend.
Figure 7.
Figure 7.
Scatter plot of the first two offsets for the four-trend model of B. subtilis, with the genes' colors given by their cluster identity given in ref. (17).

Similar articles

Cited by

References

    1. Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J. Mol. Biol. 1981;146:1–21. - PubMed
    1. Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 1981;151:389–409. - PubMed
    1. Bennetzen JL, Hall BD. Codon selection in yeast. J. Biol. Chem. 1982;257:3026–3031. - PubMed
    1. Ikemura T. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in its protein genes: differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs. J. Mol. Biol. 1982;158:573–597. - PubMed
    1. Bibb MJ, Findlay PR, Johnson MW. The relationship between base composition and codon usage in bacterial genes and its use for simple and reliable identification of protein-coding sequences. Gene. 1984;30:157–166. - PubMed

Publication types