. 2023 Oct 27;51(19):10176-10193.

doi: 10.1093/nar/gkad750.

A multi-scale expression and regulation knowledge base for Escherichia coli

Cameron R Lamoureux¹, Katherine T Decker¹, Anand V Sastry¹, Kevin Rychel¹, Ye Gao¹, John Luke McConn¹, Daniel C Zielinski¹, Bernhard O Palsson^{1

2}

Affiliations

¹ Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA.
² Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Building 220, 2800 Kgs. Lyngby, Denmark.

PMID: 37713610
PMCID: PMC10602906
DOI: 10.1093/nar/gkad750

A multi-scale expression and regulation knowledge base for Escherichia coli

Cameron R Lamoureux et al. Nucleic Acids Res. 2023.

. 2023 Oct 27;51(19):10176-10193.

doi: 10.1093/nar/gkad750.

Authors

Cameron R Lamoureux¹, Katherine T Decker¹, Anand V Sastry¹, Kevin Rychel¹, Ye Gao¹, John Luke McConn¹, Daniel C Zielinski¹, Bernhard O Palsson^{1

2}

Affiliations

¹ Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA.
² Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Building 220, 2800 Kgs. Lyngby, Denmark.

PMID: 37713610
PMCID: PMC10602906
DOI: 10.1093/nar/gkad750

Abstract

Transcriptomic data is accumulating rapidly; thus, scalable methods for extracting knowledge from this data are critical. Here, we assembled a top-down expression and regulation knowledge base for Escherichia coli. The expression component is a 1035-sample, high-quality RNA-seq compendium consisting of data generated in our lab using a single experimental protocol. The compendium contains diverse growth conditions, including: 9 media; 39 supplements, including antibiotics; 42 heterologous proteins; and 76 gene knockouts. Using this resource, we elucidated global expression patterns. We used machine learning to extract 201 modules that account for 86% of known regulatory interactions, creating the regulatory component. With these modules, we identified two novel regulons and quantified systems-level regulatory responses. We also integrated 1675 curated, publicly-available transcriptomes into the resource. We demonstrated workflows for analyzing new data against this knowledge base via deconstruction of regulation during aerobic transition. This resource illuminates the E. coli transcriptome at scale and provides a blueprint for top-down transcriptomic analysis of non-model organisms.

PubMed Disclaimer

Figures

**Figure 1.**
PRECISE-1K, a 1035-sample high-precision expression compendium, reveals expression trends in the *E. coli* transcriptome. (A) Overview of construction of PRECISE-1K compendium. Values indicate the number of *unique* categories for each condition (except evo strains). abx = antibiotics. (B) The growth in single-protocol transcriptomics samples contained in the PRECISE to PRECISE-1K databases. (C) Histogram of Pearson's r for both all replicate pairs and all non-replicate pairs (pairwise combinations of samples across projects that are not direct biological replicates). Samples included in PRECISE-1K are required to have replicate correlations of at least 0.95. (D) 2-D histogram of median expression level against median absolute deviation (MAD) of expression for all 4257 genes in PRECISE-1K. Table defines expression categories as per corresponding box color/location in histogram. For each axis, category splits are defined at median ± 1 standard deviation. (E) 2-D histogram of median-to-min expression difference against median-to-max expression difference for all 4257 genes in PRECISE-1K. Table defines regulatory categories as per corresponding box color/location in histogram. For each axis, low-to-medium split defined at 3 log₂[TPM] units (8-fold change from median expression); medium-to-high split defined at 6 log₂[TPM] units (32-fold change). (F) Median vs MAD expression 2D histogram, separated by availability of proteomics data in two large recent datasets (46,88). Blue = proteomics data available; red = no proteomics data available. (G) Histogram of the number of differentially expressed genes (DEGs) computed between condition pairs within the same project (n = 6103 pairs). GSH = glutathione, Met = methionine.

**Figure 2.**
iModulons extracted from PRECISE-1K capture the transcriptional regulatory network. (A) A breakdown of PRECISE-1K iModulons by their annotation category: ‘Regulatory’ denotes significant enrichment of one or more known regulators; ‘Technical’ includes a single gene or technical artifact iModulon; ‘Genomic’ includes iModulons related to known genomic interventions (i.e. knockouts or segmental amplifications due to adaptive laboratory evolution); and ‘Biological’ includes iModulons containing genes of related function without significant regulator enrichment, or pointing to potential new regulons. Pie chart denotes iModulon annotation categories by percentage of variance explained. Gray wedge indicates variance unexplained by iModulons. (B) Summary of precision and recall for 117 regulatory iModulons. RegulonDB (http://regulondb.ccg.unam.mx) (31) regulons used as reference. (C) 2D histograms of median gene expression and median absolute deviation in gene expression by iModulon membership. (D) Comparison of regulators and regulatory interactions recovered by PRECISE-1K iModulons and available in RegulonDB. All = all evidence levels; Strong = only strong evidence interactions per RegulonDB; P1K+ = all interactions for which the corresponding regulator is captured by an iModulon. (E) Histogram of RegulonDB regulon sizes, colored depending on whether each RegulonDB regulon is or is not captured by at least one PRECISE-1K iModulon. (F) Histogram of the number of differential iModulon activities (DiMAs) computed between condition pairs within the same project (n = 6103; same as Figure 1G). (G) Comparison of number of DEGs and DiMAs for the same condition pairs. Linear best fit curve is shown in red, and indicates a ∼20-fold dimensionality reduction from DEGs to DiMAs. n = 4483 comparisons with non-zero DiMAs.

**Figure 3.**
iModulons discover new regulons. (A) iModulon gene weights for the putative YgeV iModulon versus median log₂[TPM]. (B) Activity of the YgeV iModulon in different media conditions. Each colored bar is the mean of two biological replicates (shown as individual black points). (C) iModulon gene weights for the putative YmfT iModulon vs. median log₂[TPM]. (D) Activity of the YmfT iModulon in different media conditions. Each colored bar is the mean of two biological replicates (shown as individual black points).

**Figure 4.**
iModulons stratify existing regulons by mode of binding. (A) Diagram of Class I and Class II CRP promoters. Arrow indicates transcription start site. σ = RNA polymerase (RNAP) sigma factor; σ_N and σ_C = sigma factor N- and C-terminal regions; β, β’, ω = RNAP core subunits; Ar1-3 = CRP activating regions (RNAP interaction sites). (B) iModulon phase plane between Crp-1 and Crp-2 iModulons. Colored points from samples involving partial and total CRP deletions. Ar regions correspond to panel A. Glyc = glycerol carbon source; fru = fructose; glc = glucose. (C) Histogram of CRP binding site locations for Crp-1 and Crp-2 iModulons. TSS = transcription start site of transcription unit for each gene. Data from RegulonDB. (D) Simulated binding curve for CRP Class I and Class II promoters. Each point indicates a particular CRP concentration. Binding modeled as 10× tighter at Class II versus Class I promoters.

**Figure 5.**
Adding public K-12 data to PRECISE-1K highlights PRECISE-1K’s stability. K-12 is a combined dataset composed of PRECISE-1K (1035 samples) plus all publicly-available high-quality RNA-seq data for *E. coli* K-12 (1675 samples). (A) The accumulation of high-quality RNA-seq data for K-12 over time. (B) K-12 iModulons by their annotation category (see Figure 2A legend). Pie chart denotes iModulon annotation categories by percentage of variance explained. The 194 annotated iModulons together explain 81% of the variance. Gray wedge indicates variance unexplained by iModulons. (C) Comparison of regulators and regulatory interactions recovered by K-12 and available in RegulonDB. All = all evidence levels; Strong = only strong evidence interactions per RegulonDB; K-12+ = all interactions for which the corresponding regulator is captured by the K-12 Dataset. P1K values from Figure 2D included for comparison. (D) Comparison of iModulons from three RNA-seq datasets: PRECISE(1); PRECISE-1K (this paper); and public K-12. Each small rectangle represents an iModulon for the corresponding dataset. Pairwise Pearson correlations were performed between PRECISE and P1K iModulons, and between P1K and K-12 iModulons; iModulons with correlations over 0.3 were considered to be the same iModulon (median correlation between PRECISE and P1K iModulons is 0.68; between P1K and K-12 is 0.70). Blue = iModulon exists in all three datasets; pink = iModulon only exists in PRECISE/PRECISE-1K; red = iModulon in PRECISE-1K/K-12 only; purple = iModulon unique to dataset. Explained variance is within each dataset (i.e. PRECISE iModulons explain ∼70% of variance in PRECISE, P1K iModulons explain ∼83% of variance in PRECISE-1K, etc.). iModulons are ordered by which dataset(s) they appear in, and sorted in decreasing order of explained variance within each dataset appearance category. (E) Overlap between the CsrA regulon per RegulonDB and the CsrA iModulon. (F) Activity of the CsrA iModulon after arrest of transcription initiation via addition of rifampicin (data from Potts *et al* (66)).

**Figure 6.**
PRECISE-1K and iModulons provide key insight for assessing systems-level transcriptome changes for new data. For all graphs in this figure, the example new data comes from the public K-12 Dataset AAT (67) (anaerobic-aerobic transition) (not in PRECISE-1K, but in public K-12 metadata) which took 6 time-point samples of *E. coli* from 0 to 10 min after aeration of a previously anaerobic chemostat culture. (A) Schematic highlighting selected iModulons and systems involved in aerobic transition. (B) Top 10 regulatory iModulons by maximum activity difference between within-aat and PRECISE-1K activity (z-scored). For example, z-score of 5 for ‘Microaerobic’ iModulon indicates that the maximum activity of this iModulon amongst aat samples was 5 standard deviations from the mean activity of this iModulon in PRECISE-1K. (C) Histogram of iModulon activity across all PRECISE-1K samples and in new aat project (ArcA as example). (D) Differential iModulon activity (DiMA) plot comparing iModulon activities at aeration onset and 10 min after aeration. iModulons with significant activity differences between the two time points are in blue and labeled (see Methods for DiMA details). (E) iModulon activity by time from aeration for Fnr-2 and SoxS iModulons. (F) Phase plane comparing activities of Fur iModulons for all PRECISE-1K samples (gray) and aat samples (colored). Black dots indicate PRECISE-1K samples with *fur* knocked out. (G) Phase plane comparing activities of Fnr-2 and ArcA iModulons. aat color scheme same as (F).

See this image and copyright information in PMC

Cited by

Machine learning uncovers the Pseudomonas syringae transcriptome in microbial communities and during infection.
Bajpe H, Rychel K, Lamoureux CR, Sastry AV, Palsson BO. Bajpe H, et al. mSystems. 2023 Oct 26;8(5):e0043723. doi: 10.1128/msystems.00437-23. Epub 2023 Aug 28. mSystems. 2023. PMID: 37638727 Free PMC article.
A Causal Regulation Modeling Algorithm for Temporal Events with Application to Escherichia coli's Aerobic to Anaerobic Transition.
Chen Y, Mao R, Xu J, Huang Y, Xu J, Cui S, Zhu Z, Ji X, Huang S, Huang Y, Huang HY, Yen SC, Lin YC, Huang HD. Chen Y, et al. Int J Mol Sci. 2024 May 22;25(11):5654. doi: 10.3390/ijms25115654. Int J Mol Sci. 2024. PMID: 38891842 Free PMC article.
PGBTR: a powerful and general method for inferring bacterial transcriptional regulatory networks.
Gu WC, Ma BG. Gu WC, et al. BMC Genomics. 2025 Aug 1;26(1):712. doi: 10.1186/s12864-025-11863-9. BMC Genomics. 2025. PMID: 40750847 Free PMC article.
Deciphering the proteome of Escherichia coli K-12: Integrating transcriptomics and machine learning to annotate hypothetical proteins.
Chakraborty S, Ardern Z, Aliyu H, Kaster AK. Chakraborty S, et al. Comput Struct Biotechnol J. 2025 Jul 24;27:3565-3578. doi: 10.1016/j.csbj.2025.07.036. eCollection 2025. Comput Struct Biotechnol J. 2025. PMID: 40821719 Free PMC article.
The Environment-Dependent Regulatory Landscape of the E. coli Genome.
Röschinger T, Lee HJ, Pan RW, Solini G, Faizi K, Quan B, Chou TF, Mani M, Quake S, Phillips R. Röschinger T, et al. ArXiv [Preprint]. 2025 May 13:arXiv:2505.08764v1. ArXiv. 2025. PMID: 40463697 Free PMC article. Preprint.

See all "Cited by" articles

References

1. Sastry A.V., Gao Y., Szubin R., Hefner Y., Xu S., Kim D., Choudhary K.S., Yang L., King Z.A., Palsson B.O.. The Escherichia coli transcriptome mostly consists of independently regulated modules. Nat. Commun. 2019; 10:5536. - PMC - PubMed
1. Ziemann M., Kaspi A., El-Osta A.. Digital expression explorer 2: a repository of uniformly processed RNA sequencing data. Gigascience. 2019; 8:giz022. - PMC - PubMed
1. Leader D.P., Krause S.A., Pandit A., Davies S.A., Dow J.A.T.. FlyAtlas 2: a new version of the Drosophila melanogaster expression atlas with RNA-seq, miRNA-seq and sex-specific data. Nucleic Acids Res. 2018; 46:D809–D815. - PMC - PubMed
1. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74. - PMC - PubMed
1. GTEx Consortium Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015; 348:648–660. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A multi-scale expression and regulation knowledge base for Escherichia coli

Affiliations

A multi-scale expression and regulation knowledge base for Escherichia coli

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous