Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 30:23:3050-3064.
doi: 10.1016/j.csbj.2024.07.020. eCollection 2024 Dec.

A statistical-physics approach for codon usage optimisation

Affiliations

A statistical-physics approach for codon usage optimisation

David Luna-Cerralbo et al. Comput Struct Biotechnol J. .

Abstract

The concept of "codon optimisation" involves adjusting the coding sequence of a target protein to account for the inherent codon preferences of a host species and maximise protein expression in that species. However, there is still a lack of consensus on the most effective approach to achieve optimal results. Existing methods typically depend on heuristic combinations of different variables, leaving the user with the final choice of the sequence hit. In this study, we propose a new statistical-physics model for codon optimisation. This model, called the Nearest-Neighbour interaction (NN) model, links the probability of any given codon sequence to the "interactions" between neighbouring codons. We used the model to design codon sequences for different proteins of interest, and we compared our sequences with the predictions of some commercial tools. In order to assess the importance of the pair interactions, we additionally compared the NN model with a simpler method (Ind) that disregards interactions. It was observed that the NN method yielded similar Codon Adaptation Index (CAI) values to those obtained by other commercial algorithms, despite the fact that CAI was not explicitly considered in the algorithm. By utilising both the NN and Ind methods to optimise the reporter protein luciferase, and then analysing the translation performance in human cell lines and in a mouse model, we found that the NN approach yielded the highest protein expression in vivo. Consequently, we propose that the NN model may prove advantageous in biotechnological applications, such as heterologous protein expression or mRNA-based therapies.

Keywords: Codon-optimisation; Nearest-neighbour interaction; Protein expression; Statistical-physics model; mRNA-vaccine.

PubMed Disclaimer

Conflict of interest statement

Juan Martínez-Oliván, Esther Broset, Susana Adame-Pérez, Verónica Lampaya, Ana Larraga, Teresa Alejo and Irene Blasco-Machín are employees at Certest Pharma Department, Certest Biotec S.L.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
Scheme of the construction of the models. Workflow of the Nearest-Neighbour interaction (NN) model compared to the Individual codons (Ind) model. Reference nucleotide coding sequences (CDS) from the human genome were extracted and matched to their corresponding amino acid sequences. For the Individual model (left panel), the frequency of each codon was used to fit its energy, with β = 1, from Eq. (10). For the NN model (right panel), the likelihood of the natural codon sequence database was maximised at β = 1 to determine the model parameters. For the case of luciferase as a model protein, codon sequences are generated with the Ind and NN models at different values of β. Such sequences are subjected to a final filtering process based on various criteria, such as sequence restriction and the absence of long loops. The aim of this refinement step is to obtain the most suitable optimised sequence for the target organism.
Fig. 2
Fig. 2
Codon usage: NCBI vs Ind and NN algorithms with differentβvalues. Human-observed (NCBI) codon usage, and predictions with the NN and Ind models, at β = 1 (the “learning” inverse temperature, see Methods 5.2.3) and at β = ∞ (minimum energy solution). The NCBI codon usage line is derived by extracting codon usage information from the 116,487 sequences available in the NCBI database. For the NN and Ind designs at β = 1, codon usage is determined by considering a randomly selected dataset of 10,000 protein sequences from NCBI, and selecting one codon sequences for each of them, from a thermalised Monte Carlo run, according to the probabilities Eqs. (2), (8), respectively. In the case of β = ∞, another set of 10,000 sequences is randomly selected from NCBI, and the sequences with the minimum energy are determined for both the NN and Ind' models, see Methods 5.3.
Fig. 3
Fig. 3
Codon Adaptation Index (CAI, left panel) and Codon Pair Bias (CPB, right panel) for the codon sequences proposed by different commercial tools for the 28 proteins reported in Table 1, as well as for the wild type protein and for our NN predictions (at β = 1 and β = ∞). For comparison, the average values (dashed lines) and standard deviations (grey areas) for codons sequences in the human NCBI database are also presented. These correspond to proteins of the same length (±2 codons) as those considered in Table 1. This is because both indicators depend of sequence length. In the case of β = 1, a database of 1000 codon sequences is generated for each protein, and the mean and standard deviation (represented by an error bar) is reported. The colour code is the same in the two panels.
Fig. 4
Fig. 4
Relative Codon Bias (RCB, left panel) and Percentage of GC (Guanine and Cytosine) (GC, right panel) for the same sequences as in Fig. 3. The colour codes and abbreviations are the same as in Fig. 3.
Fig. 5
Fig. 5
Temperature dependence of the different indicators for protein luciferase, calculated with the two models. The averages of Codon Adaptation Index (CAI; Eq. (13)), Codon Pair Bias (CPB; Eq. (14)), Relative Codon Bias (RCB; Eq. (15)) and Relative Codon Pair Bias (RCPB; Eq. (16)) were calculated across databases consisting of 1000 sequences each, obtained as explained in Sec. 5.5. Error bars denote standard deviations. In each panel a black dot corresponds to the value for the sequence proposed by EMBOSS web server , , and the vertical dashed lines indicate the β values finally chosen, β = 600 representing β → ∞; see text.
Fig. 6
Fig. 6
In vitro luciferase production from mRNA sequences optimised by the Ind and the NN model. Luciferase production in HepG2 (upper panel) or Hela (lower panel) cell line, was measured in Relative Luminescence Units (RLU). The transfection reagent was a commercial cationic lipid in A) and a Lipid Nanoparticle in B). In all cases a final quantity of 100 ng of mRNA per well was used. Each bar represents the mean of at least 2 independent experiments with triplicates in each experiment, where each point represents the result for an independent well. Ind_1, Ind_3 represent individual codon optimisations using β = 1 and β = 3 respectively, and EMBOSS corresponds to using always the most frequent codon. NN_3 represents the nearest-neighbour interaction model at β = 3 and NN_∞ is the sequence obtained with the NN model at β = 600.
Fig. 7
Fig. 7
In vivo luciferase production from mRNAs sequences optimised by the Ind and the NN model. Lipid Nanoparticles, encapsulating the optimised mRNAs, were used to inoculate 1 μg of total mRNA intramuscularly to each mouse. The luminescence yield was measured by total flux quantification in photons per second (p/s) at A) 4 hours and B) 24 hours post inoculation. Each bar represents the mean of two independent experiments, one using two mice per group and the other 3 mice per group. Ind_1 represents individual codon optimisation using β = 1 and Ind_3 using β = 3 and EMBOSS using always the most frequent codon. NN_3 represents the nearest-neighbour interaction model using β = 3 and NN_∞ using β 600 which is equivalent as ∞.

References

    1. Şen A., Kargar K., Akgün E., Pınar M.Ç. Codon optimization: a mathematical programing approach. Bioinformatics. 2020;36(13):4012–4020. doi: 10.1093/bioinformatics/btaa248. - DOI - PubMed
    1. Grosjean H., Westhof E. An integrated, structure- and energy-based view of the genetic code. Nucleic Acids Res. 2016;44(17):8020–8040. doi: 10.1093/nar/gkw608. - DOI - PMC - PubMed
    1. Athey J., Alexaki A., Osipova E., Rostovtsev A., Santana-Quintero L.V., Katneni U., et al. A new and updated resource for codon usage tables. BMC Bioinform. 2017;18:391. doi: 10.1186/s12859-017-1793-7. - DOI - PMC - PubMed
    1. Liu Y. A code within the genetic code: codon usage regulates co-translational protein folding. Cell Commun Signal. 2020;18(1):145. doi: 10.1186/s12964-020-00642-6. - DOI - PMC - PubMed
    1. Mauro V.P., Chappell S.A. A critical analysis of codon optimization in human therapeutics. Trends Mol Med. 2014;20(11):604–613. doi: 10.1016/j.molmed.2014.09.003. - DOI - PMC - PubMed

LinkOut - more resources