Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 7;122(1):e2410003121.
doi: 10.1073/pnas.2410003121. Epub 2024 Dec 31.

Predicting gene sequences with AI to study codon usage patterns

Affiliations

Predicting gene sequences with AI to study codon usage patterns

Tomer Sidi et al. Proc Natl Acad Sci U S A. .

Abstract

Selective pressure acts on the codon use, optimizing multiple, overlapping signals that are only partially understood. We trained AI models to predict codons given their amino acid sequence in the eukaryotes Saccharomyces cerevisiae and Schizosaccharomyces pombe and the bacteria Escherichia coli and Bacillus subtilis to study the extent to which we can learn patterns in naturally occurring codons to improve predictions. We trained our models on a subset of the proteins and evaluated their predictions on large, separate sets of proteins of varying lengths and expression levels. Our models significantly outperformed naïve frequency-based approaches, demonstrating that there are learnable dependencies in evolutionary-selected codon usage. The prediction accuracy advantage of our models is greater for highly expressed genes and is greater in bacteria than eukaryotes, supporting the hypothesis that there is a monotonic relationship between selective pressure for complex codon patterns and effective population size. In S. cerevisiae and bacteria, our models were more accurate for longer proteins, suggesting that the learned patterns may be related to cotranslational folding. Gene functionality and conservation were also important determinants that affect the performance of our models. Finally, we showed that using information encoded in homologous proteins has only a minor effect on prediction accuracy, perhaps due to complex codon-usage codes in genes undergoing rapid evolution. Our study employing contemporary AI methods offers a unique perspective and a deep-learning-based prediction tool for evolutionary-selected codons. We hope that these can be useful to optimize codon usage in endogenous and heterologous proteins.

Keywords: codon AI model; codons prediction; mimicking codons.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Strategy to learn codon usage patterns in S. cerevisiae, S. pombe, E. coli, and B. subtilis. (A) Our dataset includes protein amino sequence data from the four organisms and BLAST-identified protein alignments. Following standard practices in AI, we split our data into training and test sets (after clustering the proteins with CD-HIT and ensuring that closely related proteins were in either the training or the test set; see Methods for details). (B) To support both the masking mode and the mimicking mode, the input format has two sequences, with each sequence preceded by its source organism. In the cartoon representation, codons are shown by solid-colored boxes, and their corresponding amino acids by hollow boxes. In masking mode, both sequences are the same, and the input is either a sequence of codons where 30% or 100% of the positions are masked. In mimicking mode, the first sequence are the codons in an orthologous protein, and the second sequence is a masked sequence of codons, where 30% or 100% of the positions are masked. (C) The training set was used to train mBART models with varying window sizes (10, 30, 50, and 75 codons). First (1), we pre-trained the models with 30% masking and mimicking. Next (2), we fine-tuned the model with the 100% masked sequences to generate the fine-tuned masking model. Third (3), we further fine-tuned the model with 100% masking and mimicking data to generate the fine-tuned mimicking model. (D) During inference, there are pre- and post-processing steps: For each protein in the test set, all sliding windows corresponding to the model window size were considered, and each sequence of codons was fully masked. The cartoon example shows predictions in masking mode for a sliding window of 10 codons in an S. cerevisiae protein. The network softmax output predicts for each amino acid a distribution over its possible codons. These predictions are combined to yield the final codon prediction for the sequence, and we measured the accuracy of the prediction with respect to the evolutionarily selected codon sequence.
Fig. 2.
Fig. 2.
The codon prediction accuracies for the test set proteins with inference in masking mode show that mBART-trained models predict codons better than the frequency-based models; the model with a 30-codon window is generally the top performer. Prediction accuracies for proteins in the test sets of S. cerevisiae, S. pombe, E. coli, and B. subtilis plotted vs. percentile ranking of expression. The average accuracies for proteins for which expression has not been measured are shown as solid horizontal lines. Data were smoothed with a Gaussian filter and a window size of 50 proteins. The mBART masking models with 10, 30, 50, and 75 codon windows are shown in red, blue, yellow, and green, respectively. The frequency-based model accuracies calculated on all data are shown in cyan, frequency-based model accuracies calculated on the top 10% of proteins based on expression levels are shown in magenta, and the bigram frequency-based model in black. In all four organisms, predictions improve when considering proteins that are more highly expressed. The improvement in accuracy is most pronounced when considering the bacterial proteins. That the accuracy of our models is better than the frequency-based baselines demonstrates that there is an evolutionary pattern of long-range dependencies among codons, and it is sufficiently pronounced that the AI models considered here can learn it.
Fig. 3.
Fig. 3.
Calculated perplexities for masking-mode mBART predictions are lower than those of the frequency-based models. Perplexity, which is the computed exponentiated average of the cross-entropy loss, plotted vs. percentile ranking of expression. The average perplexities for proteins with no measured expression levels are shown by solid horizontal lines. The data were smoothed with a Gaussian filter and a window of 50 proteins. These graphs complement Fig. 2 showing the accuracies: The mBART models perform better (lower perplexity) than the frequency-based models, the model with a window size of 30 codons is a top performer, and the perplexities are lower when considering proteins that are more highly expressed. This provides further support to that there is an evolutionary pattern of long-range dependencies among codons, and it is sufficiently pronounced that the AI models considered here can learn it.
Fig. 4.
Fig. 4.
Codon prediction accuracies for the test set proteins with inference in masking mode do not have a clear dependency on length. Prediction accuracies vs. length percentile ranking, with the data smoothed with a Gaussian filter (window size of 50 proteins). There is no simple connection between the accuracies of the models (any of them) and the lengths of the proteins: In S. cerevisiae and E. coli, the mBART models are more accurate for longer proteins, in B. subtilis, the models are more accurate for the shorter proteins, and in S. pombe, there is no clear dependency between the two.
Fig. 5.
Fig. 5.
mBART mimicking-mode inference accuracy is on par with the masking mode inference. Prediction accuracies vs. amino acid percent identify to the orthologous protein. For the dataset of homologous pairs, our goal is to predict the codon encoding of one of homologues. However, there are two modes of using the models. In the first mode, denoted "masking," the input is (two identical copies of the) masked amino acid sequence of the target protein. In the second mode, denoted "mimicking," the input is the masked amino acids of the target protein and the aligned codons in its homologue. We show masking-mode predictions by two models that were fine-tuned on the masking task (window size of 30 and 50 in blue and yellow, respectively). These are the same models and inference mode shown in Fig. 2, only here they are evaluated on the protein segments found in the alignment dataset. We also further fine-tuned these models on the masking and mimicking task, and the accuracies of the predictions of these FT models in masking-mode inference are shown in green and red, respectively. Alternatively, we used these same models in mimicking-mode inference and the accuracies of these predictions are shown in maroon and magenta. For comparison, we show the frequency-based model on this dataset in cyan and the prediction accuracy of a frequency-based mimicking model in black. Data were smoothed with a Gaussian filter (window size of 50 proteins). The signal of codons in the orthologous proteins is not sufficiently strong that our AI models can exploit it to improve their predictions, and the performance of the mimicking-mode inference (in maroon/magenta) is on par with that of the masking mode inference (in green and red), with a slight advantage for proteins with very close orthologs in bacterial organisms.
Fig. 6.
Fig. 6.
Prediction accuracy is high for proteins in certain GO functional groups. Differences between masking-task-FT mBART model (30-codon window) predictions and the frequency-based baseline for S. cerevisiae test-set proteins grouped by GO molecular function terms. Terms were sorted in descending order by the mean average difference. The terms for which the P-value is <0.05 in a Mann-Whitney rank sum test for the functionally grouped set vs. the rest of the test-set proteins are highlighted in red. We see that the mBART model outshines the naïve approach more significantly for proteins with the function “structural constituent of ribosome,” nucleic acid binding function, and sub-functions of catalytic activity.

Comment in

References

    1. Ikemura T., Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: A proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409 (1981). - PubMed
    1. Frumkin I., et al. , Codon usage of highly expressed genes affects proteome-wide translation efficiency. Proc. Natl. Acad. Sci. U.S.A. 115, E4940–E4949 (2018). - PMC - PubMed
    1. Samatova E., et al. , Translational control by ribosome pausing in bacteria: How a non-uniform pace of translation affects protein production and folding. Front. Microbiol. 11, 619430 (2020), 10.3389/fmicb.2020.619430. - DOI - PMC - PubMed
    1. Rodnina M. V., The ribosome in action: Tuning of translational efficiency and protein folding. Protein Sci. 25, 1390–1406 (2016). - PMC - PubMed
    1. Hanson G., Coller J., Codon optimality, bias and usage in translation and mRNA decay. Nat. Rev. Mol. Cell Biol. 19, 20–30 (2018). - PMC - PubMed

LinkOut - more resources