Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 9:10:173.
doi: 10.12688/wellcomeopenres.23817.2. eCollection 2025.

Decoding post-transcriptional gene expression controls in trypanosomatids using machine learning

Affiliations

Decoding post-transcriptional gene expression controls in trypanosomatids using machine learning

Michele Tinti et al. Wellcome Open Res. .

Abstract

Background: We recently described a pervasive cis-regulatory role for sequences in Trypanosoma brucei mRNA untranslated regions (UTRs). Specifically, increased translation efficiency (TE) was associated with the dosage and density of A-rich tracts. This finding raised three related questions: (1) What relative contributions do UTRs and codon usage bias make to TE in T. brucei? (2) What relative contributions do these sequences make to mRNA steady-state levels in T. brucei? (3) Do these sequences make substantial contributions to TE and/or mRNA steady-state levels in the related parasitic trypanosomatids, T. cruzi and Leishmania?

Methods: To address these questions, we applied machine learning to analyze existing transcriptome, TE, and proteomics data.

Results: Our predictions indicate that both UTRs and codon usage bias impact gene expression in all three trypanosomatids, but with substantial differences. In T. brucei, TE is primarily correlated with longer A-rich and C-poor UTRs. The situation is similar in T. cruzi, but codon usage bias makes a greater contribution to TE. In Leishmania, median TE is higher and is more strongly correlated with longer (A)U-rich UTRs and with codon usage bias. Codon usage bias has a major impact on mRNA abundance in all three trypanosomatids, while analysis of T. brucei proteomics data yielded results consistent with the view that this is due to differential translation elongation rates.

Conclusions: Taken together, our findings indicate that gene expression control in trypanosomatids operates primarily at the point of translation, which is impacted by both UTRs and codon usage. We suggest a model whereby UTRs control the rate of translation initiation, while favoured codons increase the rate of translation elongation, thereby reducing mRNA turnover.

Keywords: Codon Bias; Leishmania; Machine Learning; Translation Efficiency; Trypanosoma; UTRs.

Plain language summary

We study how three parasites ( Trypanosoma brucei, Trypanosoma cruzi, and Leishmania) control gene expression. Using computer analyses, we looked at two key factors: alternative codons, which are translated to incorporate the same amino acid in a protein, and UnTranslated Regions (UTRs); both of which can impact messenger RNA stability or the rate at which messenger RNA is translated to produce protein. We found that the impact of codons and UTRs is primarily at the point of translation. Codon usage bias likely impacts mRNA stability by increasing the rate of translation. Understanding these regulatory processes will reveal how these parasites and related cells function, in terms of expressing thousands of different proteins at appropriate levels.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Differences in expression and base-composition profiles.
A The violin plot shows TE value distributions. TE was calculated as the ratio between ribosome footprint TPM and total TPM (ribosome footprint TPM + mRNA TPM), resulting in values between 0 and 1. T. brucei (green, n = 6923), T. cruzi (orange, n = 6800), and L. donovani (blue, n = 6026). B The violin plot shows the distribution of mRNA abundance values. C The violin plot shows the distribution of mRNA lengths. D The violin plot shows the frequency distribution of each nucleotide (A, T, G, C) in 3'-UTRs of T. brucei, T. cruzi, and L. donovani. E The violin plot shows the nucleotide frequencies at the third codon position in coding sequences of T. brucei, T. cruzi, and L. donovani. For all violin plots, the internal black bars indicate median and interquartile range.
Figure 2.
Figure 2.. Machine learning models reveal determinants of translation efficiency.
Model performance and feature importance analysis for A T. brucei. B T. cruzi. C L. donovani. The scatter plots on the left show prediction versus measured TE values. The Spearman's rank correlation coefficient (Sp) is reported for each model. The beeswarm plots on the right show SHapley Additive exPlanations (SHAP) values; red indicates high feature value and blue indicates low feature value, with the magnitude of SHAP values indicating the strength and direction of each feature's impact on model predictions. Features are ordered by their absolute SHAP values (the sum of each point's absolute value), with the most important features at the top. Dots are jittered in the y-axis to illustrate SHAP values distribution. The lengths of CDS sequences and predicted 3' UnTranslated Regions (UTRs) are captured by ‘CDS_seq_len’ and '3utr_seq_len' features respectively. Features extracted from UTR regions are colour-coded in orange, with '3utr_' indicating 3' UTR features and '5utr_' indicating 5' UTR features. Simple base frequencies are denoted by the prefix 'c_' while tract frequencies use 'ct_'. Tract features are suffixed with '_m0', '_m1', or '_m2' to indicate the number of allowed mismatches (0, 1, or 2 respectively) between consecutive nucleotide stretches. Individual nucleotide frequencies at the third position of codons in coding sequences are denoted by the prefix 'third_base'. The frequency of non-optimal codons in coding sequences is represented by the 'non_opt_codon' feature.
Figure 3.
Figure 3.. Machine learning prediction and feature importance analysis for the insect stage of T. brucei.
Model performance (left) and feature importance analysis (right) for insect stage T. brucei. A TE. The scatter plot on the left shows prediction versus measured TE values. B mRNA abundance. The scatter plot on the left shows prediction versus measured mRNA abundance values (log10 transformed). Other details as in Figure 2.
Figure 4.
Figure 4.. Interaction between 3'-UTR sequence length and A-rich tracts in T. brucei.
Feature interaction analysis. The SHapley Additive exPlanations (SHAP) interaction plots illustrate the relationship between 3'-UTR sequence length (3utr_seq_len) and A tract frequency (3utr_ct_A_m2). A The y-axis represents the SHAP values for 3'-UTR sequence length, indicating its contribution to model predictions. The x-axis represents 3'-UTR sequence length in log scale. Points are coloured according to the value of the top interacting feature (3utr_ct_A_m2), with red indicating high values and blue indicating low values. B The y-axis represents the SHAP values for A tract frequency (3utr_ct_A_m2), indicating the contribution to model predictions. The x-axis represents A tract frequency values. Points are coloured according to the value of the top interacting feature (3utr_seq_len), with red indicating high values and blue indicating low values. Other details as in Figure 2.
Figure 5.
Figure 5.. Machine learning models reveal determinants of mRNA abundance.
Model performance and feature importance analysis for A T. brucei. B T. cruzi. C L. donovani. The scatter plots on the left show prediction versus measured TPM mRNA abundance values (log10 transformed). Other details as in Figure 2.
Figure 6.
Figure 6.. The relative contributions of UTRs and codons to expression control.
A The violin plot shows the distribution of Spearman's rank correlation coefficients between predicted and observed translation efficiency (TE) in T. brucei, T. cruzi, and L. donovani. Each distribution represents 100 iterations of machine learning models, where each iteration used a random 70/30 train-test split. Models were trained using three distinct feature sets: 3' UTR-derived features only (UTRs), codon-derived features only (Codons), or a combination of both UTR and codon features (Combined). B A similar analysis for mRNA abundance predictions, showing the distribution of Spearman's rank correlation coefficients between predicted and observed mRNA levels across 100 iterations with the same feature sets and train-test split methodology as in A.
Figure 7.
Figure 7.. Machine learning models reveal determinants of protein abundance.
Model performance (left) and feature importance analysis (right) for bloodstream stage T. brucei. The scatter plot on the left shows prediction versus intensity based absolute quantification (iBaq) values (log10 transformed). Other details as in Figure 2.

References

    1. Horn D: A profile of research on the parasitic trypanosomatids and the diseases they cause. PLoS Negl Trop Dis. 2022;16(1): e0010040. 10.1371/journal.pntd.0010040 - DOI - PMC - PubMed
    1. Clayton C: Regulation of gene expression in trypanosomatids: living with polycistronic transcription. Open Biol. 2019;9(6): 190072. 10.1098/rsob.190072 - DOI - PMC - PubMed
    1. Field MC, Horn D, Fairlamb AH, et al. : Anti-trypanosomatid drug discovery: an ongoing challenge and a continuing need. Nat Rev Microbiol. 2017;15(4):217–231. 10.1038/nrmicro.2016.193 - DOI - PMC - PubMed
    1. De Rycker M, Wyllie S, Horn D, et al. : Anti-trypanosomatid drug discovery: progress and challenges. Nat Rev Microbiol. 2023;21(1):35–50. 10.1038/s41579-022-00777-y - DOI - PMC - PubMed
    1. Parsons M, Myler PJ: Illuminating parasite protein production by ribosome profiling. Trends Parasitol. 2016;32(6):446–457. 10.1016/j.pt.2016.03.005 - DOI - PMC - PubMed

LinkOut - more resources