Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Mar 6:2023.03.05.531189.
doi: 10.1101/2023.03.05.531189.

Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types

Affiliations

Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types

Vikram Agarwal et al. bioRxiv. .

Update in

  • Massively parallel characterization of transcriptional regulatory elements.
    Agarwal V, Inoue F, Schubach M, Penzar D, Martin BK, Dash PM, Keukeleire P, Zhang Z, Sohota A, Zhao J, Georgakopoulos-Soares I, Noble WS, Yardımcı GG, Kulakovskiy IV, Kircher M, Shendure J, Ahituv N. Agarwal V, et al. Nature. 2025 Mar;639(8054):411-420. doi: 10.1038/s41586-024-08430-9. Epub 2025 Jan 15. Nature. 2025. PMID: 39814889 Free PMC article.

Abstract

The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a nearly comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type-specific 'on switches' providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines, and showcases how large-scale functional measurements can be used to dissect regulatory grammar.

PubMed Disclaimer

Conflict of interest statement

Competing interests. V.A. is an employee of Sanofi Pasteur Inc. J.S. is a scientific advisory board member, consultant and/or co-founder of Cajal Neuroscience, Guardant Health, Maze Therapeutics, Camp4 Therapeutics, Phase Genomics, Adaptive Biotechnologies, Scale Biosciences, Sixth Street Capital, and Pacific Biosciences. N.A. is the cofounder and on the scientific advisory board of Regel Therapeutics and receives funding from BioMarin Pharmaceutical Incorporated. All other authors declare no competing interests.

Figures

Extended Data Fig. 1:
Extended Data Fig. 1:. A next-generation lentiviral massively parallel reporter assay (lentiMPRA) strategy to measure the transcriptional regulatory activity of >6,000–240,000 enhancers simultaneously.
a, Designed 230nt oligos corresponding to thousands of CREs are synthesized on an Agilent array. The 1st round of PCR adds on a minimal promoter, while the 2nd round of PCR adds random barcodes to these sequences. The library is then cloned into a pLS-SceI vector harboring an EGFP reporter to generate the final element library. b, The element-barcode fragments within the library are amplified by PCR and sequenced using an Illumina NextSeq instrument. This enables the reconstruction of which barcodes correspond to which enhancers. c, The enhancer library is packaged into lentiviruses and transduced into HepG2, K562, or WTC11 cells in a series of three replicates. Cells are grown in cultured medium for three days prior to the harvesting of RNA and DNA. Each RNA and DNA sample from each replicate is extracted, and barcodes are sequenced on an Illumina NextSeq instrument. Finally, DNA and RNA-derived barcodes are counted to compute a normalized activity score for each element in each replicate.
Extended Data Fig. 2:
Extended Data Fig. 2:. Design and quality control characteristics of two pilot MPRA libraries.
a, Composition of the HepG2 and K562 pilot libraries. Thousands of putative enhancers, negative controls (dinucleotide shuffled sequences or elements lacking a signal from prior studies), and positive controls (elements with reported activity from prior studies) are included in each library,–. To maintain consistency with Fig. 1b, bars are colored according to orientation tested, with accompanying numbers indicating the number of elements tested in the category. Numbers are colored according to element type. b, Shown are histograms indicating the number of observed barcodes per element, for each of the three replicates and two pilot MPRA libraries. Shown with a vertical red line is the median number of barcodes per element. c-d, Shown are scatter plots displaying the relationship between observed DNA counts (blue), RNA counts (green), and RNA/DNA ratios (red) for all pairwise comparisons among replicates, for both the (c) HepG2 and (d) K562 pilot MPRA libraries. Also indicated is the Pearson (r) and Spearman (rho) correlation values. Candidate enhancers supported by fewer than 10 barcodes were filtered out prior to this analysis to reduce the impact of technical noise. e, Violin plots of element activity [measured as log2(RNA/DNA)] for putative enhancers, negative controls, and positive controls for each library.
Extended Data Fig. 2:
Extended Data Fig. 2:. Design and quality control characteristics of two pilot MPRA libraries.
a, Composition of the HepG2 and K562 pilot libraries. Thousands of putative enhancers, negative controls (dinucleotide shuffled sequences or elements lacking a signal from prior studies), and positive controls (elements with reported activity from prior studies) are included in each library,–. To maintain consistency with Fig. 1b, bars are colored according to orientation tested, with accompanying numbers indicating the number of elements tested in the category. Numbers are colored according to element type. b, Shown are histograms indicating the number of observed barcodes per element, for each of the three replicates and two pilot MPRA libraries. Shown with a vertical red line is the median number of barcodes per element. c-d, Shown are scatter plots displaying the relationship between observed DNA counts (blue), RNA counts (green), and RNA/DNA ratios (red) for all pairwise comparisons among replicates, for both the (c) HepG2 and (d) K562 pilot MPRA libraries. Also indicated is the Pearson (r) and Spearman (rho) correlation values. Candidate enhancers supported by fewer than 10 barcodes were filtered out prior to this analysis to reduce the impact of technical noise. e, Violin plots of element activity [measured as log2(RNA/DNA)] for putative enhancers, negative controls, and positive controls for each library.
Extended Data Fig. 3:
Extended Data Fig. 3:. Quality control characteristics of the three large-scale MPRA libraries.
a, Shown are histograms indicating the number of observed barcodes per element, for each of the three replicates and three large-scale MPRA libraries. Shown with a vertical red line is the median number of barcodes per element. b-d, Shown are scatter plots displaying the relationship between observed DNA counts (blue), RNA counts (green), and RNA/DNA ratios (red) for all pairwise comparisons among replicates, for the (b) HepG2, (c) K562, and (d) WTC11 large-scale MPRA libraries. Candidate elements supported by fewer than 10 barcodes were filtered out prior to this analysis to reduce the impact of technical noise. e, Scatter plots displaying the relationships between activity scores for the subset of elements common to both the pilot and large-scale libraries tested in HepG2 and K562 cells. Also indicated is the Pearson (r) and Spearman (rho) correlation values.
Extended Data Fig. 3:
Extended Data Fig. 3:. Quality control characteristics of the three large-scale MPRA libraries.
a, Shown are histograms indicating the number of observed barcodes per element, for each of the three replicates and three large-scale MPRA libraries. Shown with a vertical red line is the median number of barcodes per element. b-d, Shown are scatter plots displaying the relationship between observed DNA counts (blue), RNA counts (green), and RNA/DNA ratios (red) for all pairwise comparisons among replicates, for the (b) HepG2, (c) K562, and (d) WTC11 large-scale MPRA libraries. Candidate elements supported by fewer than 10 barcodes were filtered out prior to this analysis to reduce the impact of technical noise. e, Scatter plots displaying the relationships between activity scores for the subset of elements common to both the pilot and large-scale libraries tested in HepG2 and K562 cells. Also indicated is the Pearson (r) and Spearman (rho) correlation values.
Extended Data Fig. 4:
Extended Data Fig. 4:. Properties of promoter activity in three cell types.
a-b, Scatter plots of activity scores for sense-oriented promoters tested in the MPRA and endogenous gene expression levels for (a) K562 and (b) WTC11 cells. Expression levels follow a bimodal distribution. Also indicated are the Pearson (r) and Spearman (rho) correlation values. c, Upper triangular heatmap indicating the correlation between the sense (s) and antisense (as) orientations of promoters tested in the MPRA as well as endogenous gene expression levels measured in transcripts per million (TPM) using RNA-seq, filtered for the set of genes with detectable expression (i.e., >0 TPM). The sizes of the circles are proportional to the Pearson correlations. d-f, Scatter plots of activity scores for sense-oriented promoters tested in the MPRA and endogenous gene expression levels for (d) HepG2, (e) K562, and (f) WTC11 cells, filtered for the set of genes with detectable expression (i.e., >0 TPM). g, Set of motifs enriched in the top 1,000 most active vs. bottom 1,000 least active promoters (i.e., as measured by large-scale MPRAs). Motifs were discovered by STREME for each of the three cell types evaluated, and annotated by Tomtom (i.e., other than the set of CpG-rich motifs).
Extended Data Fig. 4:
Extended Data Fig. 4:. Properties of promoter activity in three cell types.
a-b, Scatter plots of activity scores for sense-oriented promoters tested in the MPRA and endogenous gene expression levels for (a) K562 and (b) WTC11 cells. Expression levels follow a bimodal distribution. Also indicated are the Pearson (r) and Spearman (rho) correlation values. c, Upper triangular heatmap indicating the correlation between the sense (s) and antisense (as) orientations of promoters tested in the MPRA as well as endogenous gene expression levels measured in transcripts per million (TPM) using RNA-seq, filtered for the set of genes with detectable expression (i.e., >0 TPM). The sizes of the circles are proportional to the Pearson correlations. d-f, Scatter plots of activity scores for sense-oriented promoters tested in the MPRA and endogenous gene expression levels for (d) HepG2, (e) K562, and (f) WTC11 cells, filtered for the set of genes with detectable expression (i.e., >0 TPM). g, Set of motifs enriched in the top 1,000 most active vs. bottom 1,000 least active promoters (i.e., as measured by large-scale MPRAs). Motifs were discovered by STREME for each of the three cell types evaluated, and annotated by Tomtom (i.e., other than the set of CpG-rich motifs).
Extended Data Fig. 5:
Extended Data Fig. 5:. Alternative biochemical features that could explain element activity in large-scale MPRA libraries.
a-c, Pearson correlation matrix between the top 30 features from Fig. 3b, shown as rows, and other features sharing a Pearson correlation either ≤ −0.8 or ≥ 0.8, shown as columns, for (a) HepG2, (b) K562, and (c) WTC11 cells. Hierarchical clustering was used to group features exhibiting similar correlation patterns.
Extended Data Fig. 5:
Extended Data Fig. 5:. Alternative biochemical features that could explain element activity in large-scale MPRA libraries.
a-c, Pearson correlation matrix between the top 30 features from Fig. 3b, shown as rows, and other features sharing a Pearson correlation either ≤ −0.8 or ≥ 0.8, shown as columns, for (a) HepG2, (b) K562, and (c) WTC11 cells. Hierarchical clustering was used to group features exhibiting similar correlation patterns.
Extended Data Fig. 6:
Extended Data Fig. 6:. Architecture, interpretation, and performance of MPRAnn.
a, Complete architecture of the MPRAnn model. Indicated for each layer is the layer name and dimensionality of the input and output matrices. b, Set of enriched motifs discovered by TF-MoDISco-lite for each of the three cell types evaluated. TFBSs associated with transcriptional inhibition (e.g., REST) are oriented upside down. TFBSs detected in at least two cell types (i.e., likely bound to housekeeping TFs) are shown in bold. c-f, Scatter plots showing the correlation between predicted genetic variant effects by MPRAnn and observed variant effects, as detected in a saturation mutagenesis MPRA experiment testing the (c) SORT1 enhancer, (d) PKLR promoter, (e) LDLR promoter, and (f) F9 promoter.
Extended Data Fig. 6:
Extended Data Fig. 6:. Architecture, interpretation, and performance of MPRAnn.
a, Complete architecture of the MPRAnn model. Indicated for each layer is the layer name and dimensionality of the input and output matrices. b, Set of enriched motifs discovered by TF-MoDISco-lite for each of the three cell types evaluated. TFBSs associated with transcriptional inhibition (e.g., REST) are oriented upside down. TFBSs detected in at least two cell types (i.e., likely bound to housekeeping TFs) are shown in bold. c-f, Scatter plots showing the correlation between predicted genetic variant effects by MPRAnn and observed variant effects, as detected in a saturation mutagenesis MPRA experiment testing the (c) SORT1 enhancer, (d) PKLR promoter, (e) LDLR promoter, and (f) F9 promoter.
Extended Data Fig. 6:
Extended Data Fig. 6:. Architecture, interpretation, and performance of MPRAnn.
a, Complete architecture of the MPRAnn model. Indicated for each layer is the layer name and dimensionality of the input and output matrices. b, Set of enriched motifs discovered by TF-MoDISco-lite for each of the three cell types evaluated. TFBSs associated with transcriptional inhibition (e.g., REST) are oriented upside down. TFBSs detected in at least two cell types (i.e., likely bound to housekeeping TFs) are shown in bold. c-f, Scatter plots showing the correlation between predicted genetic variant effects by MPRAnn and observed variant effects, as detected in a saturation mutagenesis MPRA experiment testing the (c) SORT1 enhancer, (d) PKLR promoter, (e) LDLR promoter, and (f) F9 promoter.
Extended Data Fig. 7:
Extended Data Fig. 7:. Quality control characteristics of the joint MPRA library.
a, Histograms indicating the number of observed barcodes per element, for each of the three replicates and three cell types tested using the joint MPRA library. Shown with a vertical red line is the median number of barcodes per element. b-d, Scatter plots displaying the relationship between observed DNA counts (blue), RNA counts (green), and RNA/DNA ratios (red) for all pairwise comparisons among replicates, for the joint MPRA library tested in (b) HepG2, (c) K562, and (d) WTC11 cells. Candidate elements supported by fewer than 10 barcodes were filtered out prior to this analysis to reduce the impact of technical noise. e, Scatter plots displaying the relationships between activity scores for the subset of elements common to both the joint and large-scale MPRA libraries tested in HepG2, K562, and WTC11 cells. Also indicated are the Pearson (r) and Spearman (rho) correlation values.
Extended Data Fig. 7:
Extended Data Fig. 7:. Quality control characteristics of the joint MPRA library.
a, Histograms indicating the number of observed barcodes per element, for each of the three replicates and three cell types tested using the joint MPRA library. Shown with a vertical red line is the median number of barcodes per element. b-d, Scatter plots displaying the relationship between observed DNA counts (blue), RNA counts (green), and RNA/DNA ratios (red) for all pairwise comparisons among replicates, for the joint MPRA library tested in (b) HepG2, (c) K562, and (d) WTC11 cells. Candidate elements supported by fewer than 10 barcodes were filtered out prior to this analysis to reduce the impact of technical noise. e, Scatter plots displaying the relationships between activity scores for the subset of elements common to both the joint and large-scale MPRA libraries tested in HepG2, K562, and WTC11 cells. Also indicated are the Pearson (r) and Spearman (rho) correlation values.
Extended Data Fig. 8:
Extended Data Fig. 8:
Violin plots of element activity [measured as log2(RNA/DNA)] in each of the three cell types for different element categories represented in the joint MPRA library shown in Fig. 5a. The difference between each pair of distributions tested was evaluated with a one-sided Wilcoxon rank-sum test, adjusted with a Bonferroni correction to account for the total number of hypothesis tests.
Extended Data Fig. 9:
Extended Data Fig. 9:. Comparison of promoter and enhancer activities from a joint MPRA library tested in three cell types.
Scatter matrix displaying scatter plots corresponding to each of the three pairs of possible inter-cell-type comparisons (lower diagonal elements), for each of four element categories: i) protein-coding gene promoters, ii) putative enhancers selected from HepG2 cells, iii) putative enhancers selected from K562 cells, and iv) putative enhancers selected from WTC11 cells. Shown on the diagonal is a histogram of element activity scores [measured as log2(RNA/DNA)]. Also shown are Pearson correlation values among each pair of comparisons, with the size of the text proportional to the magnitude of the correlation coefficient (upper diagonal elements).
Extended Data Fig. 10:
Extended Data Fig. 10:. Comparison of element specificity scores derived from a joint MPRA library tested in three cell types.
Heatmap of element specificity scores (i.e., computed as the deviation of an element’s activity from its mean activity in all cell types). The heatmap shown on the right is a zoomed version of the heatmap on the left for elements other than putative enhancers. Elements are colored according to their category in the key provided.
Extended Data Fig. 11:
Extended Data Fig. 11:. Interpretation of cell-type-specific motifs detected by MPRAnn.
Set of enriched cell-type-specific motifs discovered by TF-MoDISco-lite for each of the three cell types evaluated on the joint MPRA library.
Fig. 1:
Fig. 1:. Comprehensive functional validation of CREs in three ENCODE cell types using lentiMPRA.
a, UCSC genome browser tracks annotating, from top to bottom: i) five enhancers (HS1-HS5) of the globin locus tested in the pilot K562 MPRA library; ii) DNase hypersensitivity signal in K562 cells; iii) base conservation among 100 vertebrate species; iv) gene models in the locus shown; and v) enhancer activity scores from the pilot K562 MPRA library for each of the five enhancers tested. b, Schematic of lentiMPRA strategy for large-scale libraries. Thousands of CREs including putative enhancers (i.e., marked by DNase peaks in the respective cell-type) and promoters [i.e., centered at the transcriptional start site (TSS) of protein-coding genes] are inserted in a reporter plasmid in both orientations together with barcodes. The libraries are infected into HepG2, K562, and WTC11 cells using lentivirus, and the integrated DNA and transcribed RNA barcodes are sequenced to quantify CRE activity. c, Composition of the HepG2, K562, and WTC11 libraries. Thousands of putative enhancers and promoters, negative controls (dinucleotide shuffled sequences or elements lacking a signal from prior studies), and positive controls (elements with reported activity from prior studies) are included in each library,,. Bars are colored according to orientation tested, with accompanying numbers indicating the number of elements tested in each category. Numbers are colored according to element type. d, Violin plots of element activity, measured as log2(RNA/DNA) ratios, for CREs, negative controls, and positive controls for each library.
Fig. 2:
Fig. 2:. Orientation-dependent properties of human enhancers and promoters.
a, Beeswarm plot of the Pearson correlation values corresponding to each of the three pairwise comparisons among the three replicates. The correlations are computed between observed CRE activity values for elements positioned either in the same [Forward (F) vs. F and Reverse (R) vs. R)] or opposite (F vs. R and R vs. F) orientations. b, Scatter plots of the average activity score for each CRE in the F vs. R orientation. Regions are colored according to the density of data from light blue (low density) to yellow (high density). Also indicated are the Pearson (r) and Spearman (rho) correlation values. c, Boxplots showing the distribution of strand asymmetries for promoters and putative enhancers for each cell type. Indicated is the median residual value (bar) and 25th and 75th percentiles (box). The difference between each pair of distributions was evaluated with a one-sided Wilcoxon rank-sum test. d, Upper triangular heatmap indicates the correlation between the sense (s) and antisense (as) orientations of promoters tested in the MPRA as well as endogenous gene expression levels measured in transcripts per million (TPM) using RNA-seq. The sizes of the circles are proportional to the Pearson correlations. e, Scatter plots of activity scores for sense-oriented promoters tested in the MPRA and endogenous gene expression levels for HepG2 cells. Expression levels follow a bimodal distribution.
Fig. 3:
Fig. 3:. Prediction of sequence activity based on biochemical features.
a, Scatter plot indicating relationship between model predictions and observed element activity scores for each cell type. Pearson (r) and Spearman (rho) correlation values are shown after concatenating the observations for all 10 cross-validation folds of held-out data. b, The top 30 coefficients derived from lasso regression models trained independently on each cell type.
Fig. 4:
Fig. 4:. Sequence-based models predict regulatory element activity.
a, MPRAnn is a deep convolutional neural network architecture trained to predict CRE activity from an input sequence of the tested element. b, Violin plots showing the performances of the trained MPRAnn and EnformerMPRA models on each of the ten cross-validation folds of held-out data, relative to the corresponding performances from our biochemical lasso regression models. An improvement relative to another model was evaluated with a one-sided, paired t-test. c, Scatter plot indicating relationship between EnformerMPRA model predictions and observed element activity scores for each cell type. Also indicated are the Pearson (r) and Spearman (rho) correlation values after concatenating the observations for all ten folds of held-out data. d, Set of enriched motifs discovered by TF-MoDISco-lite; shown on the left are the top three motifs detected across multiple cell types (i.e., bound by housekeeping TFs) and on the right the top motif detected for each cell type. See also Extended Data Fig. 6b for a comprehensive list of detected motifs. e, Saturation mutagenesis data from the SORT1 enhancer. Shown in the top row is the reference sequence scaled to the mean effect size among all alternative mutations, annotated by seven out of thirteen significant TFBSs that match known motifs. Measured effect sizes of individual variants are displayed in the second row. The bottom row shows MPRAnn predictions as well as corresponding Pearson (r) and Spearman (rho) correlation values to the observed data.
Fig. 5:
Fig. 5:. Assessment of CRE cell-type-specific activity.
a, Composition of the joint library tested in HepG2, K562, and WTC11 cells. A similar proportion of putative enhancers sampled from each cell type were tested as well as a smaller number of promoters, negative controls (dinucleotide shuffled sequences or elements lacking a signal from prior studies), and positive controls (elements with reported activity from prior studies) are included in each library,. b, PCA biplot indicating the second (PC2) and third (PC3) principal components, with a random sample of up to 1,000 data points sampled from each element category tested. The loading vectors (i.e., corresponding to each cell type) as well as ellipses fitting the regions of highest density for each element category are also shown. c, Violin plots showing the distribution of element specificity scores for each element category, alongside information about which distributions shows a median significantly greater than zero (one-sided Wilcoxon signed-rank test, *p < 0.05). d, Performance of trained MPRAnn and EnformerMPRA models on each of ten cross-validation folds of held-out data, relative to the corresponding performance of lasso regression models trained on biochemical features. An improvement relative to another model was evaluated with a two-sided, paired t-test. e, Set of top three motifs enriched in each cell type discovered by TF-MoDISco-lite. See Extended Data Fig. 11 for a complete list of detected motifs.

References

    1. Rickels R. & Shilatifard A. Enhancer Logic and Mechanics in Development and Disease. Trends Cell Biol. 28, 608–630 (2018). - PubMed
    1. Chatterjee S. & Ahituv N. Gene Regulatory Elements, Major Drivers of Human Disease. Annu. Rev. Genomics Hum. Genet. 18:45–63., 10.1146/annurev-genom-091416-035537. Epub 2017 Apr 7. (2017). - DOI - PubMed
    1. Lee T. I. & Young R. A. Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251 (2013). - PMC - PubMed
    1. Maurano M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). - PMC - PubMed
    1. Manolio T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–53. doi: 10.1038/nature08494. (2009). - DOI - PMC - PubMed

Publication types