The genetic code is nearly optimal for allowing additional information within protein-coding sequences

Shalev Itzkovitz¹, Uri Alon

Affiliations

PMID: 17293451
PMCID: PMC1832087
DOI: 10.1101/gr.5987307

The genetic code is nearly optimal for allowing additional information within protein-coding sequences

Shalev Itzkovitz et al. Genome Res. 2007 Apr.

. 2007 Apr;17(4):405-12.

doi: 10.1101/gr.5987307. Epub 2007 Feb 9.

Authors

Shalev Itzkovitz¹, Uri Alon

Affiliation

¹ Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel.

PMID: 17293451
PMCID: PMC1832087
DOI: 10.1101/gr.5987307

Abstract

DNA sequences that code for proteins need to convey, in addition to the protein-coding information, several different signals at the same time. These "parallel codes" include binding sequences for regulatory and structural proteins, signals for splicing, and RNA secondary structure. Here, we show that the universal genetic code can efficiently carry arbitrary parallel codes much better than the vast majority of other possible genetic codes. This property is related to the identity of the stop codons. We find that the ability to support parallel codes is strongly tied to another useful property of the genetic code--minimization of the effects of frame-shift translation errors. Whereas many of the known regulatory codes reside in nontranslated regions of the genome, the present findings suggest that protein-coding regions can readily carry abundant additional information.

PubMed Disclaimer

Figures

**Figure 1.**
Alternative genetic codes. (A) The real code. (B) An alternative code obtained by an A↔G permutation in the first position. (C) An alternative code obtained by an A↔C permutation in the second position, and (D) A↔G permutation in the third position. Stop codons are marked in red, start (Met) codons in green. Codons that are changed relative to the real code are in gray. There are 4! × 4! × 2 = 1152 alternative codes obtained by independent permutations of the nucleotides in each of the three codon positions. (*E,F*) Structural equivalence of real and alternative genetic codes. For example, (E) the nine neighboring codons of the Valine codon marked with a red arrow in the real code (shown in A) are the same as (F) the nine neighboring codons of the Valine codon marked with a red arrow in the alternative code shown in B. Solid lines connect codons differing in the first letter, dotted lines connect codons differing in the second letter, and dashed lines connect codons differing in the third letter. Different amino acids are displayed in different colors. This equivalence applies to all codons.

**Figure 2.**
(A) Calculation of the probability that an n-mer sequence appears within a protein-coding region in the real genetic code. The 5-mer sequence S = UGACA can appear in one of the three reading frames. For each reading frame, the probabilities of all three codon combinations that contain S are summed up. Codon combinations with an in-frame stop (such as UGA) do not contribute to the n-mer probability since they cannot appear in a coding region. Vertical lines separate consecutive codons, stop codons are in red, P₀, P₋₁, P₊₁ denote the probabilities of encountering S in the 0/−1/+1 frame. (*B,C,D*) Three examples of “difficult” n-mers in the real code and in alternative codes. (B) The 5-mer UGACA, which includes the stop codon UGA, can appear in a protein-coding sequence with the real genetic code in only two of the three possible reading frames (+1 and −1 frames). (C) In the alternative code shown in Figure 3D, whose stop codon AAA overlaps with itself, the 5-mer AAAAA cannot appear in a protein-coding sequence in any of the three reading frames. (D) In an alternative code with the overlapping stop codons CCG and CGG, the 5-mer CCGGU can only appear in one reading frame. The 5-mers are in bold text, stop codons are in red, N denotes any DNA letter, green v denotes a frame in which the n-mer can appear, red x denotes a frame in which the n-mer cannot appear. (E) Distribution of the probabilities of all 6-mers in the real code (bold black line) and in the alternative codes (light blue lines). The x-axis is the probability of obtaining 6-mers within protein-coding sequences; the y-axis is the number of 6-mers with this probability. In the real code there are significantly less “difficult” 6-mers (with low probabilities), relative to the alternative codes. (F) The fraction of n-mers that have a higher probability in the real code than in alternative codes increases with n-mer size. The y-axis shows the fraction of n-mers for which the average probability of appearing in the real genetic code is significantly higher than in the alternative codes.

**Figure 3.**
Optimality of the genetic code for minimizing the impact of frame-shift translation errors. (A) Distribution of average number of translated codons until a stop codon is encountered after a frame-shift event for the alternative genetic codes. This number corresponds to the mean length of the nonsense polypeptide translated after a frame-shift event, and is the inverse of the frame-shifted stop probability, averaged over the +1 and −1 frame-shifts. (B) In the real code, frame-shifted stop codons overlap with abundant codons. Codons with two-letter overlap with a stop codon are marked by + for a +1 frame-shift and – for a −1 frame-shift. Abundant codons are shown in heavier font. For example, the stop codon UAA, when frame shifted, results in codons such as AAN (green box), or NUA (blue boxes), which are relatively abundant. (C) The “best code,” which achieves the highest frame-shifted stop probability both in a +1 frame-shift and in a −1 frame shift. Stop codons are CAA, CAG, and CGA. In the “best code,” a stop codon has an overlap of two positions with codons of Glycine instead of codons of Serine and Arginine in the real code. (D) The “worst code” with the lowest frame-shifted stop probability. Stop codons are AUA, AUG, and AAA. Note that the stop codons overlap either with themselves (AAA) or with codons for nonabundant amino-acids (those with light font), in contrast to B and C.

**Figure 4.**
The parallel coding property is strongly tied to the translational frame-shift robustness property. Each point represents one of the alternative codes. The x-axis shows the probability of encountering a stop codon upon a frame-shifted event (average over +1 and −1 frame shift). The y-axis is the average probability of appearance of the 10% most difficult 6-mers. The arrow indicates the real code. The correlation between the two properties is 0.8. The real code is on the Pareto front, meaning that no alternative code is better than the real code in both properties. Similar results are obtained for n-mers of other sizes. Note that due to symmetries in the alternative codes with respect to the features studied (Supplemental material), multiple alternative codes often have the same values.

See this image and copyright information in PMC

Comment in

Evolution and multilevel optimization of the genetic code.
Bollenbach T, Vetsigian K, Kishony R. Bollenbach T, et al. Genome Res. 2007 Apr;17(4):401-4. doi: 10.1101/gr.6144007. Epub 2007 Mar 9. Genome Res. 2007. PMID: 17351130 Review.

References

1. Alon U. An introduction to systems biology. CRC Press; London, UK: 2006.
1. Archetti M. Codon usage bias and mutation constraints reduce the level of error minimization of the genetic code. J. Mol. Evol. 2004;59:258–266. - PubMed
1. Brooks D.J., Fresco J.R., Singh M., Fresco J.R., Singh M., Singh M. A novel method for estimating ancestral amino acid composition and its application to proteins of the Last Universal Ancestor. Bioinformatics. 2004;20:2251–2257. - PubMed
1. Cartegni L., Chew S.L., Krainer A.R., Chew S.L., Krainer A.R., Krainer A.R. Listening to silence and understanding nonsense: Exonic mutations that affect splicing. Nat. Rev. Genet. 2002;3:285–298. - PubMed
1. Crick F.H. The origin of the genetic code. J. Mol. Biol. 1968;38:367–379. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The genetic code is nearly optimal for allowing additional information within protein-coding sequences

Affiliation

The genetic code is nearly optimal for allowing additional information within protein-coding sequences

Authors

Affiliation

Abstract

Figures

Comment in

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources