Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 15;27(1):108747.
doi: 10.1016/j.isci.2023.108747. eCollection 2024 Jan 19.

Deep molecular learning of transcriptional control of a synthetic CRE enhancer and its variants

Affiliations

Deep molecular learning of transcriptional control of a synthetic CRE enhancer and its variants

Chan-Koo Kang et al. iScience. .

Abstract

Massively parallel reporter assay measures transcriptional activities of various cis-regulatory modules (CRMs) in a single experiment. We developed a thermodynamic computational model framework that calculates quantitative levels of gene expression directly from regulatory DNA sequences. Using the framework, we investigated the molecular mechanisms of cis-regulatory mutations of a synthetic enhancer that cause abnormal gene expression. We found that, in a human cell line, competitive binding between family transcription factors (TFs) with slightly different binding preferences significantly increases the accuracy of recapitulating the transcriptional effects of thousands of single- or multi-mutations. We also discovered that even if various harmful mutations occurred in an activator binding site, CRM could stably maintain or even increase gene expression through a certain form of competitive binding between family TFs. These findings enhance understanding the effect of SNPs and indels on CRMs and would help building robust custom-designed CRMs for biologics production and gene therapy.

Keywords: Artificial intelligence; Experimental models in systems biology; Molecular mechanism of gene regulation.

PubMed Disclaimer

Conflict of interest statement

Patent application has been filed by Handong Global University related to this work (Korea Patent application number 10-2023-0049847).

Figures

None
Graphical abstract
Figure 1
Figure 1
Computational model framework of this study A flow diagram for the thermodynamic computational model of a synthetic CRE enhancer and its variants. Our model calculates gene expression rates considering fractional occupancies of TFBSs and 8∼32 parameters. Concentrations of the transcription factor and PWMs were used to calculate fractional occupancy on the WT and variant enhancers. During the training process (top panels), the model parameters are fitted to minimize sum squared error between MPRA single-hit experimental data and model estimates. We validated model reliability using three different methods (middle panel). First, 5-fold cross validation was conducted. Following, 4/5 single-hit sequences were used for training, and 1/5 sequences were used to test prediction power. Second, we validated models trained with single-hit data to predict and compare against MPRA multi-hit sequences, which were not used in the training process. Third, we validated whether our models could explain simple biological phenomena, such as reverse and rearrangement of enhancer sequences. During the analysis process (bottom panels), we found most reliable models by altering some mechanisms in the model and analyzing molecular mechanisms of synthetic enhancers. We analyzed the change in fractional occupancy and arrangement of TFBSs, and compared the contribution of each TF in WT and variant enhancer sequences to gene expression.
Figure 2
Figure 2
A CRE enhancer and its mutational activities (A) Synthetic cAMP response enhancer sequences. Red box indicates CRE sites and blue box represents the cryptic region. (B and C) X-axis shows enhancer position in which variants were presented. Substitution bases are presented on the top left for each panel. Y-axis shows Δactivity. (B) MPRA experimental result (C) Fitting result of 7 ATF/CREB family model. Fitting results of baseline model (B.M) are represented on the left bottom of each panel. (D) Motif logos for PWMs used in 7TF models.
Figure 3
Figure 3
Inclusion of family and non-family TFs (A) Correlation between experimental data and estimation from models with the mentioned TFs added. Dashed line distinguishes CREB1 self-competition model (CREB1_self model) from 2TFs models including non-ATF/CREB family TFs (non-family TFs) or ATF/CREB family TFs. Red line shows the best CREB1_self model’s Pearson’s R. 10 models were trained for each group. The boxes show the first and third quartiles, and the horizontal line inside each box marks the median. The vertical lines extending above and below the boxes cover a range of 1.5 times the interquartile range (IQR). Black dots outside the box represent outliers. (B–E) Correlation coefficient according to the number of ATF/CREB family TFs (CREB1, CREB3, CREB5, CREM, ATF1, ATF4, ATF7). Models are grouped by the number of ATF/CREB family TFs (n), with each group having 7Cn combinations of TFs (i.e., 1TF models: 7C1 = 7, 2TFs models: 7C2 = 21, 3TFs models: 7C3 = 35). 8 models were trained for each TF combination. Black dots on the plot represent the mean for each group, and any outliers are highlighted with colored dots. (B,C) without self-competition mechanism models (B: without self-competition single-hit fitting and C: without self-competition multi-hit prediction) and (D,E) with self-competition mechanism models (D: with self-competition single-hit fitting and E: with self-competition multi-hit prediction).
Figure 4
Figure 4
A comparison between our model and QSAM model using a hypothetical experiment involving the billboard enhancer’s feature (A–C) Sequence scheme. (A) Case 1: reversed sequence. (B) Case 2: rearranged sequence. (C) Case 3 reversed and rearranged sequence. (D–I) Comparison between normal synthetic enhancer activity (X axis) and each model prediction expression rates (Y axis). Model estimates about intent sequences are represented on the left top of each panel. (D–F) Thermodynamic model. Expression rates of (D) Case 1, (E) Case 2, and (F) Case 3 calculated with the best 4 TF model. (G-I) QSAM model. (G) Case 1, (H) Case 2, and (I) Case 3 sequences calculated with linear QSAM. (J) Multi-hit prediction with the 4 TF model. (K) Multi-hit prediction with linear QSAM. (L and M) 5-fold cross validation result of best 5 models for each number of TFs. X axis represents the number of TFs, and the Y axis shows the mean of Pearson’s correlation of (L) training set, and (M) validation set. Data are represented as mean +/− standard error.
Figure 5
Figure 5
Functional binding site analysis of A/T substitutions (A and G) Δactivity after A (A) or T (G) substitution. MPRA experimental result (top) and model calculation (bottom). (B and H) TFBSs in WT sequence. Each box represents TFBS, and transparency indicates fractional occupancy of the TFBS. The bottom red box represents the CRE sequences and the blue box represents the cryptic region. (C, D, I, and J) TFBSs in CRE1 region (C, I) and CRE4 region (D, J) after variant introduction. The wild type base and position, as well as the substituted base, are shown in the upper right corner of each figure. For example, T11A indicates that the T at the 11th position was substituted with an A. The bottom yellow box shows the position where variants were introduced. (E) Activation coefficient of TFs. (F and K) Cumulative bar plot representing ΔΔA of CRE1/CRE4 binding TFs after variant introduction. ΔΔA can be interpreted as a contribution to initiate transcription. Contributions from the same TFBSs are connected by two lines.

Similar articles

References

    1. Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J., et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012;337:1190–1195. - PMC - PubMed
    1. Patwardhan R.P., Hiatt J.B., Witten D.M., Kim M.J., Smith R.P., May D., Lee C., Andrie J.M., Lee S.I., Cooper G.M., et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 2012;30:265–270. - PMC - PubMed
    1. Melnikov A., Murugan A., Zhang X., Tesileanu T., Wang L., Rogov P., Feizi S., Gnirke A., Callan C.G., Jr., Kinney J.B., et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol. 2012;30:271–277. - PMC - PubMed
    1. Kwasnieski J.C., Mogno I., Myers C.A., Corbo J.C., Cohen B.A. Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proc. Natl. Acad. Sci. USA. 2012;109:19498–19503. - PMC - PubMed
    1. Sharon E., Kalma Y., Sharp A., Raveh-Sadka T., Levo M., Zeevi D., Keren L., Yakhini Z., Weinberger A., Segal E. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 2012;30:521–530. - PMC - PubMed

LinkOut - more resources