Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 2;20(1):183.
doi: 10.1186/s13059-019-1787-z.

MPRAnalyze: statistical framework for massively parallel reporter assays

Affiliations

MPRAnalyze: statistical framework for massively parallel reporter assays

Tal Ashuach et al. Genome Biol. .

Abstract

Massively parallel reporter assays (MPRAs) can measure the regulatory function of thousands of DNA sequences in a single experiment. Despite growing popularity, MPRA studies are limited by a lack of a unified framework for analyzing the resulting data. Here we present MPRAnalyze: a statistical framework for analyzing MPRA count data. Our model leverages the unique structure of MPRA data to quantify the function of regulatory sequences, compare sequences' activity across different conditions, and provide necessary flexibility in an evolving field. We demonstrate the accuracy and applicability of MPRAnalyze on simulated and published data and compare it with existing methods.

PubMed Disclaimer

Conflict of interest statement

All authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
MPRAnalyze model properties and fit. a Distribution of construct abundances (DNA barcodes) across datasets, computed as the observed barcode count + 1 for visualization purposes. b A graphical representation of the MPRAnalyze model. External covariates (e.g., conditions of interest, batch effects, barcode effects) are design-dependent. Latent construct and transcript counts are related by the transcription rate α. c Goodness of fit plots for both DNA and RNA libraries across datasets. Expected counts were extracted from the fitted GLMs. MPRAnalyze’s model fits MPRA data well, with R2>0.86 across all datasets. Since the Kwasnieski data only has one replicate in the DNA library, the DNA model is able to reach a perfect fit, in which case the DNA estimates used in the RNA model are identical to the original DNA counts
Fig. 2
Fig. 2
Comparison of MPRAnalyze’s α estimate of transcription rate with the ratio-based estimates agg.ratio:1ninRNAi1mjmDNAj;mean.ratio:1ninDNAiRNAia The variance measured among estimates of negative-control sequences in each dataset (these are assumed to have an identical transcription rate). b–d Barcodes were sampled and quantification was recomputed based on the partial data to measure the effect of barcode number on estimate performance [See “Methods” for further subsampling details]. Analyses were performed using the full-data estimate as the ground truth. e–g MPRA data was simulated to provide an actual ground truth. In each case we measured the bias (estimatetruth) (b,e); the standard deviation Varestimatetruth (c,f); and the Spearman correlation between the estimates and the ground truth (d,g)
Fig. 3
Fig. 3
Classification analysis comparisons. a fraction of sequences identified as significantly active (BH-corrected P<0.05) by method and class of sequence. MPRAnalyze results both in control-based (red) and no-controls (orange) modes; empirical p values based on the mean ratio (blue) or aggregated ratio (green); DESeq2 results in collapsed mode (barcodes are summed within each batch, purple) or full mode (full data, light blue). Absolute number of active sequences is displayed on the bars. b Precision-Recall curve. Precision is based on performance on the negative controls, Recall is based on the total population of sequences, assuming all candidates are active. Error bars are ± the standard deviation of these measures across datasets. c Fraction of active sequences detected after re-running the analyses on 685 sequences from the Inoue-Kreimer dataset that were identified as active by MPRAnalyze (regular mode) and both DESeq2 modes, and the 200 controls from the same dataset. MPRAnalyze recapitulates the same results, finding that 100% of the candidates are active, whereas DESeq2 full only identifies 161 (23.5%) and DESeq2 collapsed completely fails to identify any active sequences
Fig. 4
Fig. 4
Comparative analysis results of comparing timepoint 0h to 72h in the Inoue-Kreimer dataset. ap value distributions of candidates (top) and controls (bottom). QuASAR-MPRA is poorly calibrated, whereas MPRAnalyze and both mpralm modes follow the theoretical behavior (mixture of uniform and low values). b Direct comparison of MPRAnalyze to competing methods. Top panels show the biological effect size (log fold-change); Bottom panels show the statistical significance (BH-corrected p; dotted lines are 0.05 threshold). c Venn diagram for MPRAnalyze and mpralm (both modes). The numbers in each area are (top) the total number of sequences in the area, and (bottom) the number of decreasing-activity sequences (left) + and increasing-activity sequences (right). d Enrichment of transcription factor binding sites in differentially active sequences as determined by each method. Solid line represents threshold of 0.05. (see “Methods” for further details)
Fig. 5
Fig. 5
Performance evaluation in allelic comparison. a,bp value density of the three evaluated methods in both cell lines. c–f logFC values between methods in each cell type shows all methods find a similar biological signal. g–i logFC values between cell types for each method. Some differences are expected, but overall values are highly correlated. j Schematic of the enrichment analysis, testing cell-line specific functional deletions for enrichment of motifs that were gained or lost by those deletions. k, l results of motif enrichment analyses. Transcription Factors with significant enrichment (FDR < 0.05) are labeled

References

    1. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. - DOI - PMC - PubMed
    1. Kulaeva OI, Nizovtseva EV, Polikanov YS, Ulianov SV, Studitsky VM. Distant activation of transcription: mechanisms of enhancer action. Mol Cell Biol. 2012;32(24):4892–7. doi: 10.1128/MCB.01127-12. - DOI - PMC - PubMed
    1. Maston GA, Evans SK, Green MR. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet. 2006;7:29–59. doi: 10.1146/annurev.genom.7.080505.115623. - DOI - PubMed
    1. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106(23):9362–7. doi: 10.1073/pnas.0903103106. - DOI - PMC - PubMed
    1. Chatterjee S, Ahituv N. Gene regulatory elements, major drivers of human disease. Annu Rev Genomics Hum Genet. 2017;18:45–63. doi: 10.1146/annurev-genom-091416-035537. - DOI - PubMed

Publication types