Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 16;15(10):e0235924.
doi: 10.1128/mbio.02359-24. Epub 2024 Sep 17.

Deciphering transcript architectural complexity in bacteria and archaea

Affiliations

Deciphering transcript architectural complexity in bacteria and archaea

John S A Mattick et al. mBio. .

Abstract

RNA transcripts are potential therapeutic targets, yet bacterial transcripts have uncharacterized biodiversity. We developed an algorithm for transcript prediction called tp.py using it to predict transcripts (mRNA and other RNAs) in Escherichia coli K12 and E2348/69 strains (Bacteria:gamma-Proteobacteria), Listeria monocytogenes strains Scott A and RO15 (Bacteria:Firmicute), Pseudomonas aeruginosa strains SG17M and NN2 strains (Bacteria:gamma-Proteobacteria), and Haloferax volcanii (Archaea:Halobacteria). From >5 million E. coli K12 and >3 million E. coli E2348/69 newly generated Oxford Nanopore Technologies direct RNA sequencing reads, 2,487 K12 mRNAs and 1,844 E2348/69 mRNAs were predicted, with the K12 mRNAs containing more than half of the predicted E. coli K12 proteins. While the number of predicted transcripts varied by strain based on the amount of sequence data used, across all strains examined, the predicted average size of the mRNAs was 1.6-1.7 kbp, while the median size of the 5'- and 3'-untranslated regions (UTRs) were 30-90 bp. Given the lack of bacterial and archaeal transcript annotation, most predictions were of novel transcripts, but we also predicted many previously characterized mRNAs and ncRNAs, including post-transcriptionally generated transcripts and small RNAs associated with pathogenesis in the E. coli E2348/69 LEE pathogenicity islands. We predicted small transcripts in the 100-200 bp range as well as >10 kbp transcripts for all strains, with the longest transcript for two of the seven strains being the nuo operon transcript, and for another two strains it was a phage/prophage transcript. This quick, easy, and reproducible method will facilitate the presentation of transcripts, and UTR predictions alongside coding sequences and protein predictions in bacterial genome annotation as important resources for the research community.IMPORTANCEOur understanding of bacterial and archaeal genes and genomes is largely focused on proteins since there have only been limited efforts to describe bacterial/archaeal RNA diversity. This contrasts with studies on the human genome, where transcripts were sequenced prior to the release of the human genome over two decades ago. We developed software for the quick, easy, and reproducible prediction of bacterial and archaeal transcripts from Oxford Nanopore Technologies direct RNA sequencing data. These predictions are urgently needed for more accurate studies examining bacterial/archaeal gene regulation, including regulation of virulence factors, and for the development of novel RNA-based therapeutics and diagnostics to combat bacterial pathogens, like those with extreme antimicrobial resistance.

Keywords: archaeal transcripts; bacterial transcripts; direct RNA sequencing; non-coding RNA (ncRNA); small RNAs; transcriptomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig 1
Fig 1
Overview of transcript, operon, and UTR definitions used. The interrelationship of genomic features described in this manuscript is illustrated, including the relationship of operon, CT region, CDS, mRNA, ncRNA, and proteins for monocistronic/polycistronic transcripts with/without transcript isoforms. The genes and genome are fictitious and used merely to illustrate the definitions of key terms.
Fig 2
Fig 2
Overview of the experimental/analysis workflow. Plus-strand ONT direct RNA sequencing reads (shown as lines) are mapped from 1 bp to 6 kbp in the E. coli K12 genome (NC_000913.3), which corresponds to the thr operon, and sorted by their transcription stop site for E. coli K12 grown in rich LB media [left sorted (A); right sorted (C)] and DMEM media [left sorted (B); right sorted (D)]. Our algorithm predicts three transcripts (E), and four CDSs in the annotation file are illustrated (F). The transcript for the leader peptide thrL is recovered in both growth conditions. (G) RNA was isolated from E. coli K12 grown at 37°C with aeration in LB and DMEM media. (H) Squiggle plot for two sequencing reads in tandem. In this case, the open pore state was missed by the software resulting in a chimeric read. In both reads, the DNA adapter can be observed with lower current followed by a relatively flat plateau that corresponds to the polyA tail. This is followed by the electrical current changes associated with the RNA moving through the pore. (I) Plots show the electrical current for the same length DNA and RNA highlighting that the signal-to-base ratio is different for RNA and DNA. (J) The standard ONT direct RNA sequencing library was used on bacterial RNA that was in vitro polyadenylated following RNA isolation. Library construction and (K) loaded on an ONT MinION device for nanopore sequencing.
Fig 3
Fig 3
Characteristics of transcript predictions. The distribution of the number of instances of CDS by transcripts/CDS (A) and the distribution of the number of instances of transcripts by CDSs/transcript (B) are shown for E. coli K12, E. coli E2368/69, L. monocytogenes ScottA, L. monocytogenes RO15, P. aeruginosa SG17M, P. aeruginosa NN2, and H. volcanii. The data points in these discrete distributions are connected by lines for visualization purposes. The inset in each illustrates how transcripts/CDS and CDSs/transcript are defined. The size distributions of predicted 5′-UTRs (C) and 3′-UTRs (D) are plotted for each of the six strains examined with an inset that zooms in on 0–350 bp to better illustrate the distribution of the majority of the data.
Fig 4
Fig 4
fdoGHI-fdhE Transcripts. Reads mapping to the minus strand of the E. coli K12 genome (NC_000913.3) grown in LB (A, C) and DMEM (B, D) are shown for a region from 4,080-4,088 kbp. To facilitate the visualization of the starts and stops of transcripts, reads were sorted by either their left most (A, B) or right most (C, D) position and plotted from top to bottom accordingly. Transcript predictions from our algorithm (E) and the predicted CDSs in the reference annotation file (F) are shown with arrows indicating the direction of transcription and with transcripts/CDSs on the different strands having different shading (light for the (+)-strand and dark for the (-)-strand).
Fig 5
Fig 5
LEE4 operon. Reads are illustrated that map to the plus strand (A, C) and minus strand (B, D) of the E. coli E2348/69 genome (GCF_014117345.2) grown in LB or DMEM for a region from 72 to 78 kbp. There are no reads from the LB conditions on the (+)-strands. To facilitate the visualization of the starts and stops of transcripts, reads were sorted by either their left most (A, B) or right most (C, D) position and plotted from top to bottom accordingly. Transcript predictions from our algorithm (E) and the predicted CDSs in the reference annotation file (F) are shown with arrows indicating the direction of transcription and with transcripts/CDSs on the different strands having different shading [light for the (+)-strand and dark for the (−)-strand].
Fig 6
Fig 6
Differential expression of predicted transcripts. Reads are illustrated mapping to the plus strand of the E. coli E2348/69 genome (GCF_014117345.2) grown in LB (A, C) or DMEM (B, D) from 4.730 to 4.735 Mbp sorted by either their left most (A, B) or right most (C, D) position. Transcript predictions from our algorithm (E) and the predicted CDSs in the reference annotation file (F) are shown with arrows indicating the direction of transcription. Table of transcripts per million (TPM) values calculated with Salmon (31) for transcripts and FADU (18) for CDSs (G) for the same region shown in panels A–F. For ONT reads, only Salmon was used. Plot of the log2(TPM) for all CDSs and all corresponding transcripts for ERR393285 showing the discordance between TPMs calculated based on transcripts and CDSs for the same Illumina data (H). Heatmap clustered by genes for the log2(TPM) for all CDSs calculated with FADU (18) and all corresponding transcripts calculated with Salmon (31) for Illumina and ONT reads generated from LB and DMEM (I). Differences observed between a transcript-based differential expression analysis and a CDS-based differential expression analysis with FADU (18) are summarized showing the differences in upregulated and downregulated genes (J).
Fig 7
Fig 7
ONT sequencing characteristics that informed algorithm development. Size distribution of all of the E. coli K12 ONT sequencing reads aligning outside the rRNA reads compared to the distribution of predicted operons (A). For the 285,619 reads that are longer than the operon they map to, the length of reads is plotted relative to the size of the operon they map to (B). Normalized sequencing depth from the 3′-end to the 5′-end for E. coli K12 16S rRNA, E. coli K12 23S rRNA, and IVT RNA (SRR23886069), all thought to be complete, showing the 3′-bias in sequencing (C). Distribution of read lengths for the 1.3 kbp yeast enolase ONT spike-in (D) and an 11.7 kbp IVT RNA (E) from SRR23886069 where only reads ending at the far right position are shown. The log transformed ratios of Illumina (SRR3111494) and ONT (SRR23886071) TPM values for RNA isolated from adult female Brugia malayi, a filarial nematode and invertebrate animal, is compared to the transcript length, illustrating how shorter transcripts have more Illumina reads relative to ONT reads than longer transcripts (F). Our interpretation is that ONT sequencing is biased toward shorter transcripts. The inset uses the heat function to show the intensity of the points in the region which contains most of the data.

References

    1. Eichner H, Karlsson J, Loh E. 2022. The emerging role of bacterial regulatory RNAs in disease. Trends Microbiol 30:959–972. doi:10.1016/j.tim.2022.03.007 - DOI - PubMed
    1. Jacob F, Monod J. 1961. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3:318–356. doi:10.1016/s0022-2836(61)80072-7 - DOI - PubMed
    1. Forquet R, Jiang X, Nasser W, Hommais F, Reverchon S, Meyer S. 2022. Mapping the complex transcriptional landscape of the phytopathogenic bacterium dickeya dadantii. MBio 13:e0052422. doi:10.1128/mbio.00524-22 - DOI - PMC - PubMed
    1. Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J. 2000. Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci U S A 97:6652–6657. doi:10.1073/pnas.110147297 - DOI - PMC - PubMed
    1. Mäder U, Nicolas P, Depke M, Pané-Farré J, Debarbouille M, van der Kooi-Pol MM, Guérin C, Dérozier S, Hiron A, Jarmer H, Leduc A, Michalik S, Reilman E, Schaffer M, Schmidt F, Bessières P, Noirot P, Hecker M, Msadek T, Völker U, van Dijl JM. 2016. Staphylococcus aureus transcriptome architecture: from laboratory to infection-mimicking conditions. PLoS Genet 12:e1005962. doi:10.1371/journal.pgen.1005962 - DOI - PMC - PubMed

MeSH terms

LinkOut - more resources