Diverse intrinsic properties shape transcript stability and stabilization in Mycolicibacterium smegmatis

Huaming Sun¹, Diego A Vargas-Blanco², Ying Zhou², Catherine S Masiello², Jessica M Kelly^{1

2}, Justin K Moy¹, Dmitry Korkin¹, Scarlet S Shell^{1

2}

Affiliations

¹ Program in Bioinformatics and Computational Biology, Worcester Polytechnic Institute, Worcester, MA 01609, USA.
² Department of Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA.

PMID: 39498432
PMCID: PMC11532794
DOI: 10.1093/nargab/lqae147

Diverse intrinsic properties shape transcript stability and stabilization in Mycolicibacterium smegmatis

Huaming Sun et al. NAR Genom Bioinform. 2024.

. 2024 Nov 4;6(4):lqae147.

doi: 10.1093/nargab/lqae147. eCollection 2024 Sep.

Authors

Huaming Sun¹, Diego A Vargas-Blanco², Ying Zhou², Catherine S Masiello², Jessica M Kelly^{1

2}, Justin K Moy¹, Dmitry Korkin¹, Scarlet S Shell^{1

2}

Affiliations

¹ Program in Bioinformatics and Computational Biology, Worcester Polytechnic Institute, Worcester, MA 01609, USA.
² Department of Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA.

PMID: 39498432
PMCID: PMC11532794
DOI: 10.1093/nargab/lqae147

Abstract

Mycobacteria regulate transcript degradation to facilitate adaptation to environmental stress. However, the mechanisms underlying this regulation are unknown. Here we sought to gain understanding of the mechanisms controlling mRNA stability by investigating the transcript properties associated with variance in transcript stability and stress-induced transcript stabilization. We measured mRNA half-lives transcriptome-wide in Mycolicibacterium smegmatis in log phase growth and hypoxia-induced growth arrest. The transcriptome was globally stabilized in response to hypoxia, but transcripts of essential genes were generally stabilized more than those of non-essential genes. We then developed machine learning models that enabled us to identify the non-linear collective effect of a compendium of transcript properties on transcript stability and stabilization. We identified properties that were more predictive of half-life in log phase as well as properties that were more predictive in hypoxia, and many of these varied between leadered and leaderless transcripts. In summary, we found that transcript properties are differentially associated with transcript stability depending on both the transcript type and the growth condition. Our results reveal the complex interplay between transcript features and microenvironment that shapes transcript stability in mycobacteria.

PubMed Disclaimer

Figures

**Figure 1.**
Schematic of the framework to identify transcript properties that impact transcript stability in *M. smegmatis*. The framework was designed to reveal the transcript properties that were differentially associated with transcript stability depending on the transcript type and condition. Stage 1: Transcriptome-wide mRNA degradation profiles were collected after inhibition of transcription initiation with rifampicin (RIF) in log phase and hypoxia using RNAseq. Stage 2: In each condition, transcripts were classified into four groups according to their half-lives. Stages 3 and 4: A series of random forest classifiers were trained to classify transcripts into their assigned half-life class based on the values of a set of transcript properties (features), and identify the features important for these classifications.

**Figure 2.**
Transcriptome-wide mRNA degradation profiles in *M. smegmatis*. (A). UMAP projection showing condition differences and temporal changes in global degradation profiles. Each dot represents an RNAseq library from which normalized mRNA abundance values for each gene were obtained. The same UMAP projection is shown in the two panels with different coloration of the dots. In the left panel the dots are colored according to condition (dark blue, log phase; light blue, hypoxia). In the right panel the dots are colored according to timepoint after adding RIF (black, early timepoints; successively lighter shades of gray, successively later timepoints). (B). Distributions of transcript half-lives in log phase and hypoxia. Distribution plots were made in R v4.3.2 using package ggbreak v0.1.2 (79). (**C, D**). Half-life distributions with classes defined by half-life quartiles in log phase and hypoxia. (E). Comparison of half-life class membership between log phase and hypoxia. (F). Distribution of half-life fold changes in stability with classes defined by fold change quartile. (G). Frequency of essential genes in each half-life class. Significance of enrichment and depletion of essential genes within each class were tested using a hypergeometric test with FDR correction (Materials and methods). *P.adjust* < 0.05 *, *P.adjust* < 0.01 **, *P.adjust* < 0.001 ***.

**Figure 3.**
Non-linear combinations of diverse transcript properties specify half-life in *M. smegmatis*. (A) Summary of transcript features used for random forest classifiers. The features were grouped into six types and quantified for specific transcript regions. Numbers in square brackets indicate the number of features of each type selected by our feature selection process. Numbers in the colored boxes indicate the number of features of each type in each transcript region. Asterisks indicate cases where the total number of unique features is less than the sum of the numbers above because some features are classified as 5′ UTR-type features for leadered transcripts and translation-type features for leaderless transcripts (see Supplementary Table S2). (B) Comparisons of classifier performance to random prediction models. Random forest classifiers were trained separately for leadered and leaderless transcripts to predict stability class in three conditions using various feature sets. The combined feature sets were selected by the log phase model for each transcript type (see Materials and methods) and were used to train models in all three conditions. The 5′ UTR feature set includes both translation-related and non-translation-related features. The translation feature set includes log phase ribosome profiling. ΔF-score represents the difference in averaged F-score between random forest classifiers and random prediction estimators. Dots and bars represent mean and standard deviation of ΔF-scores for 10 repetitions of each model. The significance of the performance differences between random forest classifiers and random prediction estimators was tested using Nadeau and Bengio's corrected paired t-test (Materials and methods). P < 0.05 *, P < 0.01 **, P < 0.001 ***. (**C–E**). Comparisons of ΔF-scores between leadered and leaderless transcript random forest models in log phase, hypoxia, and fold change in hypoxia relative to log phase. For each condition, the combined feature sets were selected by the leaderless model and were used to train models of both transcript types. (F, G). Comparisons of ΔF-scores between log phase and hypoxia random forest models for leadered and leaderless transcripts. For each transcript type, the combined feature sets were selected by the log phase model and were used to train models of both log phase and hypoxia. The translation feature set excludes log phase ribosome profiling. The significance of the differences in model performance in (**C–G**) were tested using the Wilcoxon rank-sum test (Materials and Methods). For all panels, P < 0.05 *, P < 0.01 **, P < 0.001 ***. Feature types with significantly different model performance between leadered and leaderless transcripts (**C–E**) or between log phase and hypoxia (**F, G**) are highlighted in red.

**Figure 4.**
Transcript features differentially predict half-life for leadered and leaderless transcripts in log phase. (A). Summary of the most important features for the leadered and leaderless half-life class prediction models in log phase. Random forest classifiers were trained using the same set of features, selected by the leaderless model, for leadered and leaderless transcripts. The 20 features with the highest Gini importance rankings for each model were combined and their relative importance rankings indicated by intensity of coloration in the heatmap. See Supplementary Table S2 for feature definitions and details. (B). Feature value distributions within each half-life class for selected features that differentially predicted half-life class for leadered and leaderless transcripts. Dimmed plots indicate that the feature was less important for that gene type. Boxes around the plots indicate that the feature was more important for that gene type. Dots and bars represent median and interquartile range. (C). Comparisons of leadered gene models using only 5′ UTR features in three conditions. Models were trained and compared using the complete set of 5′ UTR features, translation-related 5′ UTR features only, or non-translation-related features only. See Supplementary Table S2 for the specific features in each category. The performance differences between random forest classifiers and random prediction estimators were tested using Nadeau and Bengio's corrected paired t-test. The log phase and hypoxia models using translation-related features were compared to each other with the Wilcoxon rank-sum test. (D). Comparison of the importance of Shine-Dalgarno sequence features and secondary structure features in the ribosome binding regions of 5′ UTRs in the log phase model for leadered transcripts. Each dot is the average Gini importance value of a feature from 10 repetitions of the model. The difference in Gini importance was tested using the Wilcoxon rank-sum test. (E). Examples of 5′ UTR features that differentially predicted half-life class between log phase and hypoxia. For all panels, P < 0.05 *, P < 0.01 **, P < 0.001 ***.

**Figure 5.**
Transcript features differentially predict half-life in log phase and hypoxia. (A). Summary of the most important features for the log phase, hypoxia and log-to-hypoxia-fold-change models for leadered and leaderless transcripts. For each transcript type, random forest classifiers were trained for all three conditions using the set of features selected by the log phase model. For each transcript type, the 20 features with the highest Gini importance scores in each condition were then combined and their relative importance rankings indicated by intensity of coloration in the heatmap. See Supplementary Table S2 for feature definitions and details. (**B–D**). Feature value distributions within each half-life class for selected features that differentially predicted half-life class in different models. Dimmed plots indicate that the feature was less important for that condition and/or gene type. Boxes around the plots indicate that the feature was more important for that condition and/or gene type. Dots and bars represent median and interquartile range. (B). Selected features that were differentially important for log phase and hypoxia models for leadered genes. (C). Selected features that were more important for log phase leaderless transcript models and are expected to impact the secondary structure of the 5′ ends of coding sequences. Plots for leadered genes are shown for comparison even though these features were not highly ranked for any leadered transcript models. (D). Selected secondary-structure-related features that were relatively highly ranked for leaderless genes in both log phase and hypoxia models but showed different patterns of distributions across half-life classes for the two conditions. (E). Log phase ribosome occupancy was quantified separately for each third of the CDS of each leaderless transcript. The x axes denote abundance of reads from ribosome-bound RNAs mapping to the indicated transcript regions. (F). For leaderless genes, the log phase ribosome occupancy for the first third of each CDS was plotted as a function of the ΔG_MFE of the first third of the CDS. r_s denotes Spearman correlation, with the statistical significance in square brackets. P < 0.01 **.

**Figure 6.**
Steady-state transcript abundance is negatively correlated with half-life, while transcript length is positively correlated with mRNA half-life in hypoxia. (A). Distributions of steady-state transcript abundance within each half-life class in log phase and hypoxia for leadered and leaderless transcripts. (B). Correlations between steady-state abundance and transcript half-life. While abundance was highly ranked in all models (see Figure 5A), its negative correlation with half-life was stronger in hypoxia. (C). Distributions of 5′ UTR lengths within half-life classes for leadered transcripts in log phase and hypoxia. This feature had a high importance ranking in hypoxia only. (D). Distributions of CDS length within each half-life class in log phase and hypoxia. This feature was highly ranked in all models. (**E,F**). Half-lives were calculated for only the first 300 nt of each CDS and genes were selected that had similar half-lives for the 5′ 300 nt and the whole CDS (see Supplementary Figure S10). For these subsets of genes in log phase and hypoxia, the correlation between CDS length and 5' 300 nt half-life are shown. In **(B, E, F)**, r_s denotes Spearman correlation, with the statistical significance in square brackets. P < 0.05 *, P < 0.01 **, P < 0.001 ***.

See this image and copyright information in PMC

References

1. World Health Organization Global Tuberculosis Report 2023. 2023; Geneva: World Health Organization.
1. Rustad T.R., Minch K.J., Brabant W., Winkler J.K., Reiss D.J., Baliga N.S., Sherman D.R. Global analysis of mRNA stability in mycobacterium tuberculosis. Nucleic Acids Res. 2013; 41:509–517. - PMC - PubMed
1. Via L.E., Lin P.L., Ray S.M., Carrillo J., Allen S.S., Eum S.Y., Taylor K., Klein E., Manjunatha U., Gonzales J. et al. Tuberculous granulomas are hypoxic in guinea pigs, rabbits, and nonhuman primates. Infect. Immun. 2008; 76:2333–2340. - PMC - PubMed
1. Belton M., Brilha S., Manavaki R., Mauri F., Nijran K., Hong Y.T., Patel N.H., Dembek M., Tezera L., Green J. et al. Hypoxia and tissue destruction in pulmonary TB. Thorax. 2016; 71:1145–1153. - PMC - PubMed
1. Cortes T., Schubert O.T., Rose G., Arnvig K.B., Comas I., Aebersold R., Young D.B. Genome-wide mapping of transcriptional start sites defines an extensive leaderless transcriptome in Mycobacterium tuberculosis. Cell Rep. 2013; 5:1121–1131. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Diverse intrinsic properties shape transcript stability and stabilization in Mycolicibacterium smegmatis

Affiliations

Diverse intrinsic properties shape transcript stability and stabilization in Mycolicibacterium smegmatis

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources