Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar;639(8054):411-420.
doi: 10.1038/s41586-024-08430-9. Epub 2025 Jan 15.

Massively parallel characterization of transcriptional regulatory elements

Affiliations

Massively parallel characterization of transcriptional regulatory elements

Vikram Agarwal et al. Nature. 2025 Mar.

Abstract

The human genome contains millions of candidate cis-regulatory elements (cCREs) with cell-type-specific activities that shape both health and many disease states1. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these cCREs. Here we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of more than 680,000 sequences, representing an extensive set of annotated cCREs among three cell types (HepG2, K562 and WTC11), and found that 41.7% of these sequences were active. By testing sequences in both orientations, we find promoters to have strand-orientation biases and their 200-nucleotide cores to function as non-cell-type-specific 'on switches' that provide similar expression levels to their associated gene. By contrast, enhancers have weaker orientation biases, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict cCRE function and variant effects with high accuracy, delineate regulatory motifs and model their combinatorial effects. Testing a lentiMPRA library encompassing 60,000 cCREs in all three cell types further identified factors that determine cell-type specificity. Collectively, our work provides an extensive catalogue of functional CREs in three widely used cell lines and showcases how large-scale functional measurements can be used to dissect regulatory grammar.

PubMed Disclaimer

Conflict of interest statement

Competing interests: V.A. is an employee of Sanofi, but performed this work independently of Sanofi. J.S. is a scientific advisory board member, consultant and/or co-founder of Cajal Neuroscience, Guardant Health, Maze Therapeutics, Camp4 Therapeutics, Phase Genomics, Adaptive Biotechnologies, Scale Biosciences, Somite Therapeutics, Sixth Street Capital and Pacific Biosciences. N.A. is a co-founder and on the scientific advisory board of Regel Therapeutics and receives funding from BioMarin Pharmaceutical Incorporated. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. lentiMPRA in three cell types.
a, lentiMPRA strategy for large-scale libraries. Thousands of cCREs including potential enhancers (marked by DNase peaks) and promoters (centred at the transcription start site (TSS)) are inserted into a reporter plasmid in both orientations together with barcodes. The libraries are infected into cells using lentivirus and the integrated DNA and transcribed RNA barcodes are sequenced to quantify cCRE activity. b, Composition of the HepG2, K562 and WTC11 libraries. Thousands of potential enhancers and promoters, negative controls and positive controls are included in each library,,. Bars are coloured according to orientation tested and the number of tested elements is shown above the bars and coloured according to element type. c, Violin plots of element activity, measured as log2-transformed RNA/DNA ratios for cCREs and negative and positive controls. Promoter and enhancer distributions were compared against the shuffled category using a one-sided Wilcoxon rank-sum test, followed by Bonferroni correction (*P < 10−8).
Fig. 2
Fig. 2. Properties and orientation dependence of promoters and potential enhancers.
a, Beeswarm plot of the Pearson correlation values corresponding to each of the three pairwise comparisons among the three replicates. The correlations are computed between observed cCRE activity values for elements positioned either in the same (forward versus forward (FvF) and reverse versus reverse (RvR)) or opposite (forward versus reverse (FvR) and reverse versus forward (RvF)) orientations. b, Scatter plot of the average activity score for each cCRE in the forward versus reverse orientation. Regions are coloured according to the density of data from light blue (low) to yellow (high). Pearson (r) and Spearman (ρ) correlation values are shown. c, Box plots showing the distribution of strand asymmetries for promoters and potential enhancers for each cell type. The centre line is the median residual value and box edges delineate 25th and 75th percentiles, evaluated with a one-sided Wilcoxon rank-sum test. d, The heat map indicates the correlation between the sense (s) and antisense (as) orientations of promoters and endogenous gene expression levels measured in transcripts per million (TPM) using RNA sequencing. The sizes of the circles are proportional to the Pearson correlations. e, Scatter plot of activity scores for sense-oriented promoters and endogenous gene expression levels for HepG2 cells. f,g, Volcano plots indicating the enrichment of HOCOMOCO v.12 transcription factor families in the top 1,000 versus bottom 1,000 promoters (f) and potential enhancers (g), as ranked by MPRA activity. Enrichment (measured as an odds ratio) and Benjamini–Hochberg corrected q values computed using Fisher’s exact test. Significant families above the P value acceptance threshold (dashed horizontal line) are labelled with a representative transcription factor family member with the general transcription factor family in parentheses.
Fig. 3
Fig. 3. Sequence-based models predict regulatory element activity.
a, MPRALegNet is a deep CNN trained to predict cCRE activity from an input sequence of the tested element. b, Violin plots showing the performances of sequence-based and biochemical models on ten cross-validation (CV) folds, with improvement relative to another model evaluated using a one-sided, paired t-test. B, biochemical; E, EnformerMPRA; L, MPRALegNet; N, MPRAnn; S, SeiMPRA; NS, not significant. c, Scatter plots indicating relationship between MPRALegNet predictions and observed element activity scores for each cell type. d, Set of enriched motifs discovered by TF-MoDISco-lite. Left, top three motifs detected across multiple cell types. Right, top motif detected for each cell type. e, Heat map indicating the relationship between homotypic TFBS dosage (n = 1 to 5 TFBSs) and the observed MPRALegNet-predicted response in K562 cells. f, Box plots showing the full dosage-dependent distributions for the STAT1/4/5A/5B transcription factor family, along with the expected effect in the scenario of either a multiplicative or additive model. The number of elements represented in each group is indicated above the plot. g, Heat map indicating interaction term coefficients reflecting super-multiplicative (red) and sub-multiplicative effects (blue), for elements possessing the indicated pair of heterotypic TFBSs in HepG2 cells. Coefficients fit to observed and predicted values are shown in the top right and bottom left halves of the heat map, respectively. h, Relationship of the distributions of observed and MPRALegNet-predicted activity scores in HepG2 cells for the subset of elements possessing either zero or one TFBSs corresponding to the NFYA/C or FOXD2 families, or one TFBS for both transcription factor families, adjusted for potential confounding effects induced by the presence of other transcription factors (Methods). The red horizontal line is the expected activity score under a multiplicative model. Box centre lines and edges as in Fig. 2. Whiskers show the most extreme data point up to a maximum of 1.5× the interquartile range.
Fig. 4
Fig. 4. MPRALegNet variant effect prediction.
a, Scatter plots of predicted variant effects and observed allele-specific differences detected in ChIP–seq, ATAC-seq and DNase-seq data in K562 and HepG2 cells. Indicated are the number of cases in which the predictions and observations are concordant (C; blue), discordant (D; red), or not considered (grey) because either the ASV FDR > 0.05 or the model predictions are too uncertain (P value > 0.05; Supplementary Methods). The corresponding odds ratio (OR) is also indicated (Fisher’s exact test using the 2 × 2 contingency tables provided in Supplementary Table 9). b, Saturation mutagenesis data from the PKLR enhancer. Top, the reference sequence scaled to the mean effect size among all alternative mutations, annotated by 6 out of 8 significant TFBSs that match known motifs. Middle, measured effect sizes of individual variants. Bottom, MPRALegNet predictions with corresponding Pearson (r) and Spearman (ρ) correlation values.
Fig. 5
Fig. 5. Assessment of cCRE cell-type-specific activity.
a, Composition of the joint library tested in all three cells. b, Detection of active elements in K562 cells lacking DNase signal. We evaluated the DNase signal for the subset of elements with low K562 DNase accessibility (left of vertical red line, indicated by red bracket) in the other two cell types and then quantified a smoothed kernel density estimate of MPRA activity in K562 cells. c, Principal components analysis biplot indicating the second (PC2) and third (PC3) principal components, with a random sample of up to 1,000 data points from each element category plotted. The loading vectors (corresponding to each cell type) as well as ellipses fitting the regions of highest density for each element category are also shown. d, Violin plots showing the distribution of element specificity scores for each element category, alongside information about which distributions show a median significantly greater than zero (one-sided Wilcoxon signed-rank test, *P < 0.05). Inner boxplots are plotted as described for Fig. 3f,h. e, Performance of trained MPRAnn, MPRALegNet, SeiMPRA and EnformerMPRA models on each of ten cross-validation folds of held-out data, relative to the corresponding performance of lasso regression models trained on biochemical features, with improvement relative to another model evaluated with a one-sided, paired t-test.
Extended Data Fig. 1
Extended Data Fig. 1. A next-generation lentiviral massively parallel reporter assay (lentiMPRA) strategy to measure the transcriptional regulatory activity of > 6,000–240,000 enhancers simultaneously.
a, Designed 230nt oligos corresponding to thousands of cCREs are synthesized on an Agilent array. The 1st round of PCR adds on a minimal promoter, while the 2nd round of PCR adds random barcodes to these sequences. The library is then cloned into a pLS-SceI vector harboring an EGFP reporter to generate the final element library. b, The element-barcode fragments within the library are amplified by PCR and sequenced using an Illumina NextSeq instrument. This enables reconstruction of element-barcode pairings. c, The element library is packaged into lentiviruses and transduced into HepG2, K562, or WTC11 cells in a series of three replicates. Cells are grown in cultured medium for three days prior to the harvesting of RNA and DNA. Each RNA and DNA sample from each replicate is extracted, and barcodes are sequenced on an Illumina NextSeq instrument. Finally, DNA and RNA-derived barcodes are counted to compute a normalized activity score for each element in each replicate. d, UCSC genome browser tracks annotating, from top to bottom: i) Lead single nucleotide polymorphisms (SNPs) from published Genome-wide Association Studies (GWAS); ii) Common SNPs from the 1000 Genomes Phase 3 dataset; iii) GENCODE gene track; iv) MPRA activity scores from the pilot K562 MPRA library for each of the five enhancers tested, with stronger red indicative of higher activity; v) MPRA scores corresponding to the large-scale K562 MPRA library, tested in both orientations; vi) H3K27Ac; vii) DNase I hypersensitivity signal in K562 cells; viii) base conservation among 100 vertebrate species; and ix) the five enhancers (HS1-HS5) of the globin locus tested in the pilot and large-scale K562 MPRA libraries. Image of DNA sequencer created with BioRender.com.
Extended Data Fig. 2
Extended Data Fig. 2. MPRA activity in selected disease loci.
a-f, UCSC genome browser tracks annotating, from top to bottom: i) Lead single nucleotide polymorphisms (SNPs) from published Genome-wide Association Studies (GWAS); ii) Common SNPs from the 1000 Genomes Phase 3 dataset; iii) GENCODE gene track; iv) MPRA activity scores from the pilot K562 MPRA library for each of the five enhancers tested, with stronger red indicative of higher activity; v) MPRA scores corresponding to the large-scale K562 MPRA library, tested in both orientations; vi) H3K27Ac and vii) DNase I hypersensitivity signal in K562 cells; viii) base conservation among 100 vertebrate species. Snapshots provided for BCL11A (a), GATA1 (b), and HBA2 (c). Additional loci are shown in Supplementary Fig. 3.
Extended Data Fig. 3
Extended Data Fig. 3. Properties of promoter activity in three cell types.
a-b, Scatter plots of activity scores for sense-oriented promoters tested in the MPRA and endogenous gene expression levels for (a) K562 and (b) WTC11 cells. Expression levels follow a bimodal distribution. Also indicated are the Pearson (r) and Spearman (rho) correlation values. c, Upper triangular heatmap indicating the correlation between the sense (s) and antisense (as) orientations of promoters tested in the MPRA as well as endogenous gene expression levels measured in transcripts per million (TPM) using RNA-seq, filtered for the set of genes with detectable expression (i.e., > 0 TPM). The sizes of the circles are proportional to the Pearson correlations. d, Alternative representation of the data shown in Fig. 2d and panel (c), showing the Pearson correlation between each pair of measurements indicated below the horizontal line. Black points represent all genes (i.e., akin to Fig. 2d) and red points represent the expressed subset of genes [i.e., akin to panel (c)]. e-g, Scatter plots of activity scores for sense-oriented promoters tested in the MPRA and endogenous gene expression levels for (e) HepG2, (f) K562, and (g) WTC11 cells, filtered for the set of genes with detectable expression (i.e., > 0 TPM). h-j, Scatter plots of activity scores for sense-oriented promoters tested in the MPRA and endogenous gene expression levels for HepG2 (h), K562 (i), and WTC11 (j) cells, as measured by CAGE-seq signal in the precise promoter tested. Due to lack of availability of processed CAGE-seq for WTC11, we instead used H1-ESCs, a transcriptionally similar embryonic stem cell line.
Extended Data Fig. 4
Extended Data Fig. 4. Enriched motifs detected in three cell types.
a-b, Set of motifs enriched in the top 1,000 most active vs. bottom 1,000 least active promoters (a) or potential enhancers (b) (i.e., as measured by large-scale MPRAs). Motifs were discovered by STREME67 for each of the three cell types evaluated, and matched against the JASPAR 2022 CORE vertebrate non-redundant database using Tomtom (i.e., other than the set of CpG-rich motifs). Motifs above the horizontal line in each panel are those associated with gene activation, while motifs below the line are those associated with repression.
Extended Data Fig. 5
Extended Data Fig. 5. Architecture and performance of MPRALegNet.
a, Violin plots showing the performances of different variations of MPRALegNet on each of the ten cross-validation folds of held-out data, for different types of augmentations. “-” and “+” indicate removal or usage, respectively, of the following augmentations: i) “test-time”, whereby the mean prediction is computed for various augmentations of the test sequence; ii) “shift”, whereby a sequence was randomly shifted by 0 to +21 bp; iii) “RevComp”, whereby a sequence was randomly reverse complemented; iv) “Orientation”, whereby measured element activity scores were considered for each orientation tested, instead of the mean across both orientations; and v) “5th channel”, whereby a 5th channel was considered alongside the one-hot encoded sequence (i.e., the first 4 channels) to indicate the sequence’s orientation. b, Complete architecture of the MPRALegNet model. Indicated for each layer is the layer name and dimensionality of the input and output matrices. ‘None’ refers to the batch size used during model training. c, Impact of the size of the training set on model performance. Data from each cell type were downsampled to every 10th percentile (i.e., from 10 to 100%). Error bars represent the standard deviation of the Pearson correlations across 90 models (10 held-out folds of data x 9 trained models varying by the choice of validation set).
Extended Data Fig. 6
Extended Data Fig. 6. Motifs detected by MPRALegNet.
Set of enriched motifs discovered by TF-MoDISco-lite for each of the three cell types evaluated. Motifs shown are rank-ordered according to their “seqlet” count. TFBSs associated with transcriptional inhibition (e.g., REST) are oriented upside down and shown below the horizontal lines. TFBSs detected in at least two cell types (i.e., likely bound to housekeeping TFs) are shown in bold.
Extended Data Fig. 7
Extended Data Fig. 7. Combinatorial TFBS effects learned by MPRALegNet.
a, Effect of TF knockdown on loss of regulatory element activity, from a reanalysis of a prior MPRA. Shown are cumulative density plots for the subset of elements possessing TF binding sites to the corresponding knocked down TF, relative to a control (i.e., non-targeting guide RNA) shown in black. The x-axis indicates the minimum (i.e., most negative) fold change among three guide RNAs targeting the TF. Data shown is from the high MOI condition sampled at day 10. P-values indicate a shift in the distribution as assessed by a one-sided Kolmogorov-Smirnov (K-S) test, followed by a Bonferroni multiple hypothesis testing correction. b-e, These panels are arranged in the same scheme as Fig. 3e,f, except display results for homotypic TFBSs in HepG2 (b-c) and WTC11 (d-e) cells for the indicated TF families. f, Scatter plot of interaction terms fit to predicted and observed values for TFBS pairs in HepG2 cells. The data is the same as that presented in Fig. 3g, but also includes Pearson (r) and Spearman (rho) correlation values. g, This panel is similar to that shown in Fig. 3h, but shows an example of a pair of heterotypic TFBSs that exhibit a sub-multiplicative effect when co-occurring. h-k, These panels are arranged in the same scheme as Fig. 3g and panel (f), except display results for K562 (h-i) and WTC11 (j-k) cells for the indicated TF families. l, This panel is similar to that shown in Fig. 3h, but shows an example of a pair of heterotypic TFBSs that exhibit a super-multiplicative effect when co-occurring in WTC11 cells.
Extended Data Fig. 8
Extended Data Fig. 8. Variant effect predictions in the RBM38 and LMO2 loci.
a, UCSC genome browser snapshot of the RBM38 locus, showing from top to bottom: i) Lead SNPs from published GWAS; ii) variant effect predictions derived from MPRALegNet for LD variants with GWAS lead SNPs (R2 ≥ 0.8); iii) MPRA activity scores from the pilot K562 MPRA library for each of the five enhancers tested, with stronger red indicative of higher activity; iv) MPRA scores corresponding to the large-scale K562 MPRA library, tested in both orientations; v) H3K27Ac and vi) DNase I hypersensitivity signal in K562 cells; vii) base conservation among 100 vertebrate species; viii) ENCODE cCRE track. The bottom panel shows two zoomed-in regions of the implicated causal SNPs (i.e., expanded from the vertical dashed lines), several of which are located within a DNase I site. b, This panel follows the same scheme, except displays the LMO2 locus and one zoomed in region showing the implicated causal SNP located within a DNase I site having the strongest predicted effect size.
Extended Data Fig. 9
Extended Data Fig. 9. Performance of MPRALegNet in saturation mutagenesis prediction task.
a-c, Saturation mutagenesis data from the SORT1 enhancer (a), LDLR promoter (b), and F9 promoter (c). Shown in the top row is the reference sequence scaled to the mean effect size among all alternative mutations, annotated by significant TFBSs that match known motifs. Measured effect sizes of individual variants are displayed in the second row. The bottom row shows MPRALegNet predictions as well as corresponding Pearson (r) and Spearman (rho) correlation values to the observed data. d, Scatter plots showing the correlation between predicted genetic variant effects by MPRALegNet and observed variant effects, as detected in a saturation mutagenesis MPRA experiment testing the PKLR promoter, SORT1 enhancer, LDLR promoter, and F9 promoter.
Extended Data Fig. 10
Extended Data Fig. 10. Interpretation of cell-type-specific motifs detected by MPRALegNet.
Set of enriched cell-type-specific motifs discovered by TF-MoDISco-lite for each of the three cell types evaluated on the joint MPRA library. Motifs shown are rank-ordered according to their “seqlet” count.

Update of

References

    1. ENCODE Project Consortium et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature583, 699–710 (2020). - PMC - PubMed
    1. Chatterjee, S. & Ahituv, N. Gene Regulatory Elements, Major Drivers of Human Disease. Annu. Rev. Genomics Hum. Genet.18, 45–63 (2017). - PubMed
    1. Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science337, 1190–1195 (2012). - PMC - PubMed
    1. Alexander, R. P., Fang, G., Rozowsky, J., Snyder, M. & Gerstein, M. B. Annotating non-coding regions of the genome. Nat. Rev. Genet.11, 559–571 (2010). - PubMed
    1. Inoue, F. & Ahituv, N. Decoding enhancers using massively parallel reporter assays. Genomics106, 159–164 (2015). - PMC - PubMed

MeSH terms