. 2019 Sep 2;20(1):183.

doi: 10.1186/s13059-019-1787-z.

MPRAnalyze: statistical framework for massively parallel reporter assays

Tal Ashuach^{1

2}, David S Fischer^{3

4}, Anat Kreimer^{1

5

6}, Nadav Ahituv^{5

6}, Fabian J Theis³, Nir Yosef^{7

8

9

10}

Affiliations

¹ Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California, USA.
² Center for Computational Biology, University of California Berkeley, Berkeley, California, USA.
³ Institute of Computational Biology, Helmholz Zentrum München, Neuherberg, Germany.
⁴ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
⁵ Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, USA.
⁶ Institute for Human Genetics, University of California San Francisco, San Francisco, California, USA.
⁷ Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California, USA. niryosef@berkeley.edu.
⁸ Center for Computational Biology, University of California Berkeley, Berkeley, California, USA. niryosef@berkeley.edu.
⁹ Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, USA. niryosef@berkeley.edu.
¹⁰ Chan Zuckerberg BioHub, San Francisco, California, USA. niryosef@berkeley.edu.

PMID: 31477158
PMCID: PMC6717970
DOI: 10.1186/s13059-019-1787-z

MPRAnalyze: statistical framework for massively parallel reporter assays

Tal Ashuach et al. Genome Biol. 2019.

. 2019 Sep 2;20(1):183.

doi: 10.1186/s13059-019-1787-z.

Authors

Tal Ashuach^{1

2}, David S Fischer^{3

4}, Anat Kreimer^{1

5

6}, Nadav Ahituv^{5

6}, Fabian J Theis³, Nir Yosef^{7

8

9

10}

Affiliations

¹ Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California, USA.
² Center for Computational Biology, University of California Berkeley, Berkeley, California, USA.
³ Institute of Computational Biology, Helmholz Zentrum München, Neuherberg, Germany.
⁴ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
⁵ Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, USA.
⁶ Institute for Human Genetics, University of California San Francisco, San Francisco, California, USA.
⁷ Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California, USA. niryosef@berkeley.edu.
⁸ Center for Computational Biology, University of California Berkeley, Berkeley, California, USA. niryosef@berkeley.edu.
⁹ Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, USA. niryosef@berkeley.edu.
¹⁰ Chan Zuckerberg BioHub, San Francisco, California, USA. niryosef@berkeley.edu.

PMID: 31477158
PMCID: PMC6717970
DOI: 10.1186/s13059-019-1787-z

Abstract

Massively parallel reporter assays (MPRAs) can measure the regulatory function of thousands of DNA sequences in a single experiment. Despite growing popularity, MPRA studies are limited by a lack of a unified framework for analyzing the resulting data. Here we present MPRAnalyze: a statistical framework for analyzing MPRA count data. Our model leverages the unique structure of MPRA data to quantify the function of regulatory sequences, compare sequences' activity across different conditions, and provide necessary flexibility in an evolving field. We demonstrate the accuracy and applicability of MPRAnalyze on simulated and published data and compare it with existing methods.

PubMed Disclaimer

Conflict of interest statement

All authors declare that they have no competing interests.

Figures

**Fig. 1**
MPRAnalyze model properties and fit. a Distribution of construct abundances (DNA barcodes) across datasets, computed as the observed barcode count + 1 for visualization purposes. b A graphical representation of the MPRAnalyze model. External covariates (e.g., conditions of interest, batch effects, barcode effects) are design-dependent. Latent construct and transcript counts are related by the transcription rate α. c Goodness of fit plots for both DNA and RNA libraries across datasets. Expected counts were extracted from the fitted GLMs. MPRAnalyze’s model fits MPRA data well, with R²>0.86 across all datasets. Since the Kwasnieski data only has one replicate in the DNA library, the DNA model is able to reach a perfect fit, in which case the DNA estimates used in the RNA model are identical to the original DNA counts

**Fig. 2**
Comparison of MPRAnalyze’s α estimate of transcription rate with the ratio-based estimates $(agg.ratio: \frac{\frac{1}{n} \sum_{i}^{n} RN A_{i}}{\frac{1}{m} \sum_{j}^{m} DN A_{j}}; mean.ratio: \frac{1}{n} \sum_{i}^{n} \frac{DN A_{i}}{RN A_{i}})$ a The variance measured among estimates of negative-control sequences in each dataset (these are assumed to have an identical transcription rate). **b–d** Barcodes were sampled and quantification was recomputed based on the partial data to measure the effect of barcode number on estimate performance [See “Methods” for further subsampling details]. Analyses were performed using the full-data estimate as the ground truth. **e–g** MPRA data was simulated to provide an actual ground truth. In each case we measured the bias (estimate−truth) (**b,e**); the standard deviation $(\sqrt{Var (estimate - truth)})$ (**c,f**); and the Spearman correlation between the estimates and the ground truth (**d,g**)

**Fig. 3**
Classification analysis comparisons. a fraction of sequences identified as significantly active (BH-corrected P<0.05) by method and class of sequence. MPRAnalyze results both in control-based (red) and no-controls (orange) modes; empirical p values based on the mean ratio (blue) or aggregated ratio (green); DESeq2 results in collapsed mode (barcodes are summed within each batch, purple) or full mode (full data, light blue). Absolute number of active sequences is displayed on the bars. b Precision-Recall curve. Precision is based on performance on the negative controls, Recall is based on the total population of sequences, assuming all candidates are active. Error bars are ± the standard deviation of these measures across datasets. c Fraction of active sequences detected after re-running the analyses on 685 sequences from the Inoue-Kreimer dataset that were identified as active by MPRAnalyze (regular mode) and both DESeq2 modes, and the 200 controls from the same dataset. MPRAnalyze recapitulates the same results, finding that 100% of the candidates are active, whereas DESeq2 full only identifies 161 (23.5%) and DESeq2 collapsed completely fails to identify any active sequences

**Fig. 4**
Comparative analysis results of comparing timepoint 0h to 72h in the Inoue-Kreimer dataset. ap value distributions of candidates (top) and controls (bottom). QuASAR-MPRA is poorly calibrated, whereas MPRAnalyze and both mpralm modes follow the theoretical behavior (mixture of uniform and low values). b Direct comparison of MPRAnalyze to competing methods. Top panels show the biological effect size (log fold-change); Bottom panels show the statistical significance (BH-corrected p; dotted lines are 0.05 threshold). c Venn diagram for MPRAnalyze and mpralm (both modes). The numbers in each area are (top) the total number of sequences in the area, and (bottom) the number of decreasing-activity sequences (left) + and increasing-activity sequences (right). d Enrichment of transcription factor binding sites in differentially active sequences as determined by each method. Solid line represents threshold of 0.05. (see “Methods” for further details)

**Fig. 5**
Performance evaluation in allelic comparison. **a,b**p value density of the three evaluated methods in both cell lines. **c–f** logFC values between methods in each cell type shows all methods find a similar biological signal. **g–i** logFC values between cell types for each method. Some differences are expected, but overall values are highly correlated. j Schematic of the enrichment analysis, testing cell-line specific functional deletions for enrichment of motifs that were gained or lost by those deletions. **k, l** results of motif enrichment analyses. Transcription Factors with significant enrichment (FDR < 0.05) are labeled

See this image and copyright information in PMC

References

1. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. - DOI - PMC - PubMed
1. Kulaeva OI, Nizovtseva EV, Polikanov YS, Ulianov SV, Studitsky VM. Distant activation of transcription: mechanisms of enhancer action. Mol Cell Biol. 2012;32(24):4892–7. doi: 10.1128/MCB.01127-12. - DOI - PMC - PubMed
1. Maston GA, Evans SK, Green MR. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet. 2006;7:29–59. doi: 10.1146/annurev.genom.7.080505.115623. - DOI - PubMed
1. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106(23):9362–7. doi: 10.1073/pnas.0903103106. - DOI - PMC - PubMed
1. Chatterjee S, Ahituv N. Gene regulatory elements, major drivers of human disease. Annu Rev Genomics Hum Genet. 2017;18:45–63. doi: 10.1146/annurev-genom-091416-035537. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U01HG007910/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MPRAnalyze: statistical framework for massively parallel reporter assays

Affiliations

MPRAnalyze: statistical framework for massively parallel reporter assays

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical