. 2014 May 20:8:93-108.

doi: 10.4137/BBI.S13161. eCollection 2014.

The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins

Miguel Ponce de Leon¹, Antonio Basilio de Miranda², Fernando Alvarez-Valin¹, Nicolas Carels²

Affiliations

¹ Sección Biomatemática, Facultad de Ciencias, Universidad de la República, Iguá, Montevideo, Uruguay.
² Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil.

PMID: 24899802
PMCID: PMC4039185
DOI: 10.4137/BBI.S13161

The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins

Miguel Ponce de Leon et al. Bioinform Biol Insights. 2014.

. 2014 May 20:8:93-108.

doi: 10.4137/BBI.S13161. eCollection 2014.

Authors

Miguel Ponce de Leon¹, Antonio Basilio de Miranda², Fernando Alvarez-Valin¹, Nicolas Carels²

Affiliations

¹ Sección Biomatemática, Facultad de Ciencias, Universidad de la República, Iguá, Montevideo, Uruguay.
² Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil.

PMID: 24899802
PMCID: PMC4039185
DOI: 10.4137/BBI.S13161

Abstract

For this report, we analyzed protein secondary structures in relation to the statistics of three nucleotide codon positions. The purpose of this investigation was to find which properties of the ribosome, tRNA or protein level, could explain the purine bias (Rrr) as it is observed in coding DNA. We found that the Rrr pattern is the consequence of a regularity (the codon structure) resulting from physicochemical constraints on proteins and thermodynamic constraints on ribosomal machinery. The physicochemical constraints on proteins mainly come from the hydropathy and molecular weight (MW) of secondary structures as well as the energy cost of amino acid synthesis. These constraints appear through a network of statistical correlations, such as (i) the cost of amino acid synthesis, which is in favor of a higher level of guanine in the first codon position, (ii) the constructive contribution of hydropathy alternation in proteins, (iii) the spatial organization of secondary structure in proteins according to solvent accessibility, (iv) the spatial organization of secondary structure according to amino acid hydropathy, (v) the statistical correlation of MW with protein secondary structures and their overall hydropathy, (vi) the statistical correlation of thymine in the second codon position with hydropathy and the energy cost of amino acid synthesis, and (vii) the statistical correlation of adenine in the second codon position with amino acid complexity and the MW of secondary protein structures. Amino acid physicochemical properties and functional constraints on proteins constitute a code that is translated into a purine bias within the coding DNA via tRNAs. In that sense, the Rrr pattern within coding DNA is the effect of information transfer on nucleotide composition from protein to DNA by selection according to the codon positions. Thus, coding DNA structure and ribosomal machinery co-evolved to minimize the energy cost of protein coding given the functional constraints on proteins.

Keywords: RNY; ancestral codon; energy cost; genomics; helix; purine bias; ribosome; secondary structure; sheet; translation; turn coil.

PubMed Disclaimer

Figures

**Figure 1**
Relative frequency of secondary structures according to the protein size. The sample size for each structure is n = 10,731. **Notes:** A, aperiodic (median = 49%, average = 50%, σ = 14.2%, skewness = 1.0) (A). B, α-helix (median = 32%, average = 33%, σ = 18.4%, skewness = 0.5) (H). C, β-sheets (median = 19%, average = 24%, σ = 14.7%, skewness = 0.7) (E).

**Figure 2**
Relationships between the H and E proportions in protein sequences from PDB. The regression line is y = −1.33x + 60.07 with a correlation coefficient r = −0.75.

**Figure 3**
Average features of amino acids encoded by the G1, A1, C1, and T1 codons weighted by relative frequencies per structure. In A, codons in G1, A1, C1, and T1 account for 37, 27, 21, and 15%, respectively. In H, codons in G1, A1, C1, and T1 account for 36, 26, 22, and 16%, respectively. In E, codons in G1, A1, C1, and T1 account for 33, 29, 18, and 20%, respectively (see Tables S7–S9). **Notes:** A, MW. B, number of chemical bonds. C, energy cost for synthesis. D, hydropathy.

**Figure 4**
Average frequencies of amino acids with A1, G1, C1, and T1 codons (see Tables S6–S9) according to MW. **Notes:** A, dataset of non-redundant proteins from PDB (r = −0.584, P < 0.05). B, A structures in PDB (r = −0.626, P < 0.01). C, H structures in PDB (r = −0.236, P = 0.321). D, E structures in PDB (r = −0.272, P = 0.251).

**Figure 5**
Relationships between purines in the three codon positions according to periodic and aperiodic structures from PDB. The sample size for each structure is n = 10,731. **Notes:** Panel A, r_A = 0.476**, *A: y* = 0.98x + 7.66, r_H = 0.308**, *H: y* = 0.95x + 9.17, r_E = 0.111**, and *E: y* = 0.42x + 11.98. Panel B, r_A = 0.549**, *A: y* = 2.16x − 39.05, r_H = 0.426**, *H: y* = 2.38x − 41.33, r_E = 0.400**, and *E: y* = 1.84x − 35.92. Panel C, r_A = 0.318**, *A: y* = 0.53x − 1.18, r_H = −0.017, *H: y* is not defined because P > 0.05, r_E = −0.001, and *E: y* = −0.07x + 16.56. Panel D, r_A = 0.255**, *A: y* = 3.01x − 86.68, r_H = 0.193**, *H: y* = 3.85x − 107.85, and r_E = 0.187**, *E: y* = 2.97x − 72.90 (**statistical significance at P < 0.001).

**Figure 6**
Relationships between A, G, and T in the second codon position according to periodic and aperiodic structures. The sample size for each structure is n = 10,731. **Notes:** Panel A, r_A = −0.141**, *A: y* = −6.61x + 167.03, r_H = −0.320**, *H: y* = −2.28x + 105.66, r_E = 0.506**, and *E: y* = −0.80x + 59.65. Panel B, r_A = −0.530**, *A: y* = −1.60x + 68.33, r_H = −0.422**, *H: y* = −2.31x + 65.95, and r_E = −0.189**, *E: y* = −2.91x + 65.62 (**statistical significance at P < 0.001).

**Figure 7**
Relationships between purines in the three codon positions according to periodic and aperiodic structures. The sample size for each structure is n = 10,731. **Notes:** Panel A, r_A = 0.289**, *A: y* = 0.71x + 18.48, r_H = −0.016, H: y is not defined because P > 0.05, and r_E = −0.257**, *E: y* = −0.81x + 89.15. Panel B, r_A = 0.164**, *A: y* = 1.85x − 73.87, r_H = 0.135**, *H: y* = 1.79x − 59.92, and r_E = 0.108**, *E: y* = 1.14x − 27.24 (**statistical significance at P < 0.001).

**Figure 8**
Relationships between GC2 and GC3 according to periodic and aperiodic structures. The sample size for each structure is n = 10,731. **Notes:** r_E = 0.223**, *E: y* = 9.79x − 250.91, r_H = 0.376**, *H: y* = 5.49x − 131.49, and r_A = 0.459**, *A: y* = 5.21x − 181.28 (**statistical significance at P < 0.001).

**Figure 9**
Scatter plot of G3 versus C3. The sample size for each structure is n = 10,731. **Notes:** Red is for H (r = 0.397**, y = 0.67x + 12.01), green is for E (r = 0.325**, y = 0.43x + 12.42), and black is for A (r = 0.550**, y = 0.49x + 9.88) (**statistical significance at P < 0.001).

**Figure 10**
Relationships between hydropathy, ASA, average MW, and the energy cost of amino acid synthesis in protein secondary structures. The sample size for each structure is n = 10,731. **Notes:** Panel A, ASA, r_A = −0.755**, *A: y* = −0.031x + 0.463, r_H = −0.821**, *H: y* = −0.03x + 0.446, and r_E = 0.840**, *E: y* = −0.029x + 0.435. Panel B, MW, r_A = −0.454**, *A: y* = −29.24x + 103.00, r_H = −0.575**, *H: y* = −18.89x + 129.52, and r_E = −0.523**, *E: y* = −16.78x + 144.62. Panel C, energy cost of amino acid synthesis, r = 0.605, y = 4x + 21 (**statistical significance at P < 0.001).

**Figure 11**
Relationships between the number of heteroatoms (NOS), hydropathy, and ASA. The sample size for each structure is n = 10,731. **Notes:** Panel A, r_A = −0.821**, *A: y* = −0.28x + 0.80, r_H = −0.862**, *H: y* = −0.28x + 0.88, and r_E = 0.893**, *E: y* = −0.29x + 0.90. Panel B, r_A = 0.406**, *A: y* = 20.4x − 8.94, r_H = 0.598**, *H: y* = 14.70x − 5.68, and r_E = 0.665**, *E: y* = 12.50x − 4.16 (dashed line of panel B: y = 3.0x − 1.25) (**statistical significance at P < 0.001).

**Figure 12**
Relationships between MW, the energy cost of amino acid synthesis, A2, and G1. The sample size for each structure is n = 10,731. **Notes:** Panel A, amino acid synthesis, r_A = 0.486**, *A: y* = 4.59x + 44.14, r_H = 0.527**, *H: y* = 3.35x + 63.28, and r_E = 0.602**, *E: y* = 2.52x + 71.80. Panel B, A2, r_A = 0.619**, *A: y* = 0.47x + 109.98, r_H = 0.560**, *H: y* = 0.43x + 117.53, and r_E = 0.564**, *E: y* = 0.50x + 119.62. Panel C, G1, r_A = −0.590**, *A: y* = −0.48x + 143.97, r_H = −0.526**, *H: y* = −0.41x + 147.21, r_E = −0.586**, *E: y* = −0.44x + 146.39 (**statistical significance at <0.001).

**Figure 13**
Relationships between T2, ASA, hydropathy, and the energy cost of amino acid synthesis. The sample size for each structure is n = 10,731. **Notes:** Panel A, r = −0.903**, *A: y* = −345.45x + 190.0. Panel B, r_A = 0.597**, r_H = 0.670**, and r_E = 0.863**, *E: y* = 17.27x + 30.97. Panel C, r = 0.605, y = 4.23x − 55 (**statistical significance at P < 0.001).

See this image and copyright information in PMC

Cited by

Moonlighting genes harbor antisense ORFs that encode potential membrane proteins.
Thomas KE, Gagniuc PA, Gagniuc E. Thomas KE, et al. Sci Rep. 2023 Aug 3;13(1):12591. doi: 10.1038/s41598-023-39869-x. Sci Rep. 2023. PMID: 37537268 Free PMC article.
A Metagenomic Analysis of Bacterial Microbiota in the Digestive Tract of Triatomines.
Carels N, Gumiel M, da Mota FF, de Carvalho Moreira CJ, Azambuja P. Carels N, et al. Bioinform Biol Insights. 2017 Sep 27;11:1177932217733422. doi: 10.1177/1177932217733422. eCollection 2017. Bioinform Biol Insights. 2017. PMID: 28989277 Free PMC article.
Plant Tolerance to Drought Stress with Emphasis on Wheat.
Adel S, Carels N. Adel S, et al. Plants (Basel). 2023 May 30;12(11):2170. doi: 10.3390/plants12112170. Plants (Basel). 2023. PMID: 37299149 Free PMC article. Review.
An Interpretation of the Ancestral Codon from Miller's Amino Acids and Nucleotide Correlations in Modern Coding Sequences.
Carels N, Ponce de Leon M. Carels N, et al. Bioinform Biol Insights. 2015 Apr 15;9:37-47. doi: 10.4137/BBI.S24021. eCollection 2015. Bioinform Biol Insights. 2015. PMID: 25922573 Free PMC article.
Physicochemical Foundations of Life that Direct Evolution: Chance and Natural Selection are not Evolutionary Driving Forces.
Auboeuf D. Auboeuf D. Life (Basel). 2020 Jan 21;10(2):7. doi: 10.3390/life10020007. Life (Basel). 2020. PMID: 31973071 Free PMC article.

References

1. Shepherd JCW. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci USA. 1981;78:1596–600. - PMC - PubMed
1. Mortimer JR, Forsdyke DR. Comparison of responses by bacteriophage and bacteria to pressures on the base composition of open reading frames. Appl Bioinformatics. 2003;2:47–62. - PubMed
1. Carels N, Vidal R, Frias D. Universal features for the classification of coding and non-coding DNA sequences. Bioinform Biol Insights. 2009;3:37–49. - PMC - PubMed
1. Carels N, Frias D. Classifying coding DNA with nucleotide statistics. Bioinform Biol Insights. 2009;3:141–54. - PMC - PubMed
1. Carels N, Frias D. The contribution of stop codon frequency and purine bias to the classification of coding sequences. In: Mondaini R, editor. Biomat 2012: International Symposium on Mathematical and Computational Biology. World Scientific; Singapore: 2013. pp. 301–22.

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins

Affiliations

The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources