Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May 13;7(6):1519-32.
doi: 10.1093/gbe/evv088.

Global Shifts in Genome and Proteome Composition Are Very Tightly Coupled

Affiliations

Global Shifts in Genome and Proteome Composition Are Very Tightly Coupled

Maria Brbić et al. Genome Biol Evol. .

Abstract

The amino acid composition (AAC) of proteomes differs greatly between microorganisms and is associated with the environmental niche they inhabit, suggesting that these changes may be adaptive. Similarly, the oligonucleotide composition of genomes varies and may confer advantages at the DNA/RNA level. These influences overlap in protein-coding sequences, making it difficult to gauge their relative contributions. We disentangle these effects by systematically evaluating the correspondence between intergenic nucleotide composition, where protein-level selection is absent, the AAC, and ecological parameters of 909 prokaryotes. We find that G + C content, the most frequently used measure of genomic composition, cannot capture diversity in AAC and across ecological contexts. However, di-/trinucleotide composition in intergenic DNA predicts amino acid frequencies of proteomes to the point where very little cross-species variability remains unexplained (91% of variance accounted for). Qualitatively similar results were obtained for 49 fungal genomes, where 80% of the variability in AAC could be explained by the composition of introns and intergenic regions. Upon factoring out oligonucleotide composition and phylogenetic inertia, the residual AAC is poorly predictive of the microbes' ecological preferences, in stark contrast with the original AAC. Moreover, highly expressed genes do not exhibit more prominent environment-related AAC signatures than lowly expressed genes, despite contributing more to the effective proteome. Thus, evolutionary shifts in overall AAC appear to occur almost exclusively through factors shaping the global oligonucleotide content of the genome. We discuss these results in light of contravening evidence from biophysical data and further reading frame-specific analyses that suggest that adaptation takes place at the protein level.

Keywords: amino acid composition; ecological preferences; fungal genome; intergenic DNA; oligonucleotide composition; prokaryotic genome; support vector regression.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.—
Fig. 1.—
The oligonucleotide frequencies in the noncoding DNA of prokaryotes are highly predictive of their proteome compositions. (A) Explained variance (as squared Pearson correlation coefficient, R2) in the amino acid usage of proteomes in a multiple regression against different sets of features; by considering only the G + C content (blue bars), and by progressively including also the dinucleotide frequencies (red), the trinucleotides (teal), and phylogenetic groups (purple). Error bars are standard deviations from ten runs of cross-validation. (B, C) The median variance explained using the same sets of features over all 20 amino acids (B) or only over the seven G + C balanced amino acids (THEVDQC) (C). The “bias estimate” is from bootstrapping (Materials and Methods).
F<sc>ig</sc>. 2.—
Fig. 2.—
Nonlinear SVM regression models that predict amino acid usage in proteomes from G + C and dinucleotide frequencies in noncoding DNA. Dependency of relative frequencies of Ala (A) and Met (B) in proteomes on the G + C content of DNA, as examples of a linear and nonlinear relationship, respectively. Each dot is a prokaryotic chromosome (>200 kb in size). Red curves show SVM predictions. Several examples which deviate strongly from the dominant trend are highlighted by the vertical lines that show residuals of the regression. SVM regression models that regress the relative frequency of Thr (C) and Val (D) in proteomes against a combination of the G + C content and the frequency of the ApC + GpT dinucleotide.
F<sc>ig</sc>. 3.—
Fig. 3.—
Accuracy in classifying prokaryotes by environmental preference from the AAC of proteomes and from oligonucleotide frequencies in noncoding DNA. (A, B) Distributions of AACs (given as relative frequencies of each amino acid) across proteomes, as well as the residuals of the amino acid composition in SVM regression. Asterisks are Mann–Whitney tests (two-tailed) applied to distributions of residuals. *FDR < 25%; **FDR < 10%; ***FDR < 1%. ROC curves for discriminating thermophiles from mesophiles (C) and strict anaerobes from aerotolerant organisms (D). Orange curves show predictions from AAC in proteomes, green curves from noncoding DNA (G + C content, di- and trinucleotide frequencies) and phylogenetic descriptors (clade memberships), and blue curves from AAC after a normalization for oligonucleotide frequencies in noncoding DNA and for phylogenetic relatedness (residuals from regression of AAC on these features). AUROC scores are given in plot legends, where 1.0 indicates perfect performance, and 0.5 random guessing (shown as the diagonal line). Predictions in the ROC curves are from an SVM classifier, in 10-fold cross-validation. TPR, true positive rate; FPR, false positive rate. More environments shown in supplementary figure S4, Supplementary Material online.
F<sc>ig</sc>. 4.—
Fig. 4.—
Composition of noncoding DNA in 49 fungal genomes is highly predictive of the corresponding proteome composition. (A) Explained variance (as squared Pearson correlation coefficient, R2) in amino acid usage of proteomes in a multiple regression against different sets of features; obtained by considering only the G + C content (blue bars), and by progressively including also the dinucleotide frequencies (red), the trinucleotides (teal), and phylogenetic groups (purple). Error bars are standard deviations from ten runs of cross-validation. (B) The median variance explained using the same sets of features over all 20 amino acids. (C) Cross-validation ROC curves describing the accuracy of discrimination of 13 thermophilic fungi by their AAC (orange) or by the genome composition-normalized AAC (the “AAC residuals,” blue). Inlaid numbers are AUROC scores.
F<sc>ig</sc>. 5.—
Fig. 5.—
Lack of a particular environment-associated signal in the AAC of highly expressed proteins. (A) The RMSEs in predicting the frequencies of each amino acid from the composition of noncoding DNA (G + C, di- and trinucleotide content) and phylogenetic relatedness (clade membership) of organisms. RMSEs are compared for lowly versus highly expressed genes across all organisms. (B) Binned and pooled ROC curves for classifying the organisms by various environmental preferences from AAC, after having factored out the composition of noncoding DNA and phylogeny. ROC curves shown separately for classification only from highly expressed or only from lowly expressed genes. Full ROC curves for individual environments shown in supplementary figure S5, Supplementary Material online. Average and 95% CI of AUROC scores inlaid on plots.
F<sc>ig</sc>. 6.—
Fig. 6.—
Differences in environment-specific trends in dinucleotide composition of 1st, 2nd, and 3rd codon sites in protein-coding genes. Shifts in G + C and dinucleotide frequencies between thermophilic and nonthermophilic (A), halophilic and nonhalophilic (B), strictly anaerobic and aerotolerant (C), and psychrophilic and nonpsychrophilic (D) organisms at different codon positions. Bars show AUROC scores, a measure of separability of two distributions by the given feature, where 0.5 signifies maximal overlap, and most extreme values (0 or 1) indicate complete separation of, for example, thermophiles and mesophiles by the frequency of given dinucleotide; values less than 0.5 and greater than 0.5 here indicate opposite directions of the shift. Error bars are 95% CI of the AUROC. Asterisks show significant differences in the environment-associated shifts between codon positions at less than 10% FDR.
F<sc>ig</sc>. 7.—
Fig. 7.—
Distributions of selected dinucleotide frequencies at 1st, 2nd, and 3rd codon positions of protein-coding genes. (A–D) Ellipses show nine-number summaries of distributions, with borders indicating (in the increasing intensity of coloration) the minimum–maximum, 1st–7th octile, 2nd–6th octile, and 3rd–5th octile. Dinucleotide frequencies are normalized to the expected frequency given the G + C content. Plotted separately for thermophiles (A), halophiles (B), aerotolerant organisms (C), and psychrophiles (D). Letters in center of ellipse denote the environmental preference (t, thermophile; h, halophile; a, aerotolerant; p, psychrophile), and the number indicates the 1st, 2nd, or 3rd codon position this repeats.

References

    1. Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G. 2008. Support vector machines and kernels for computational biology. PLoS Comput Biol. 4:e1000173. - PMC - PubMed
    1. Berka RM, et al. 2011. Comparative genomic analysis of the thermophilic biomass-degrading fungi Myceliophthora thermophila and Thielavia terrestris. Nat Biotech. 29:922–927. - PubMed
    1. Blockeel H, Raedt LD, Ramon J. 1998. Top-down induction of clustering trees. In: Shavlik JW, editor. Proceedings of the Fifteenth International Conference on Machine Learning. ICML ’98; Madison, Wisconsin. San Francisco (CA): Morgan Kaufmann Publishers Inc. p. 55–63. Available from: http://dl.acm.org/citation.cfm?id=645527.657456.
    1. Bohlin J, Brynildsrud O, Vesth T, Skjerve E, Ussery DW. 2013. Amino acid usage is asymmetrically biased in AT- and GC-rich microbial genomes. PLoS One 8:e69878. - PMC - PubMed
    1. Breiman L. 2001. Random forests. Mach Learn. 45:5–32.

Publication types