. 2024 Feb 28;52(4):1613-1627.

doi: 10.1093/nar/gkae012.

Optimizing sequence design strategies for perturbation MPRAs: a computational evaluation framework

Jiayi Liu^{1

2

3}, Tal Ashuach⁴, Fumitaka Inoue⁵, Nadav Ahituv^{6

7}, Nir Yosef^{8

9

10}, Anat Kreimer^{2

3}

Affiliations

¹ Graduate Program in Cell & Developmental Biology, Rutgers, The State University of New Jersey, 604 Allison Rd, Piscataway, NJ 08854, USA.
² Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, NJ 08854, USA.
³ Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, NJ 08854, USA.
⁴ Department of Electrical Engineering and Computer Sciences and Center for Computational Biology, University of California, Berkeley, 387 Soda Hall, Berkeley, CA 94720, USA.
⁵ Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Faculty of Medicine Building B, Yoshidatachibanacho, Sakyo Ward, Kyoto 606-8303, Japan.
⁶ Department of Bioengineering and Therapeutic Sciences, University of California, 1700 4th Street, San Francisco, CA 94158, USA.
⁷ Institute for Human Genetics, University of California, 513 Parnassus Ave, San Francisco, CA 94143, USA.
⁸ Department of Systems Immunology, Weizmann Institute of Science, 234 Herzl Street, Rehovot 7610001, Israel.
⁹ Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA.
¹⁰ Department of Systems Immunology, Ragon Institute of MGH, MIT, and Harvard Institute of Science, 400 Technology Square, Cambridge, MA 02139, USA.

PMID: 38296821
PMCID: PMC10939410
DOI: 10.1093/nar/gkae012

Optimizing sequence design strategies for perturbation MPRAs: a computational evaluation framework

Jiayi Liu et al. Nucleic Acids Res. 2024.

. 2024 Feb 28;52(4):1613-1627.

doi: 10.1093/nar/gkae012.

Authors

Jiayi Liu^{1

2

3}, Tal Ashuach⁴, Fumitaka Inoue⁵, Nadav Ahituv^{6

7}, Nir Yosef^{8

9

10}, Anat Kreimer^{2

3}

Affiliations

¹ Graduate Program in Cell & Developmental Biology, Rutgers, The State University of New Jersey, 604 Allison Rd, Piscataway, NJ 08854, USA.
² Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, NJ 08854, USA.
³ Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, NJ 08854, USA.
⁴ Department of Electrical Engineering and Computer Sciences and Center for Computational Biology, University of California, Berkeley, 387 Soda Hall, Berkeley, CA 94720, USA.
⁵ Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Faculty of Medicine Building B, Yoshidatachibanacho, Sakyo Ward, Kyoto 606-8303, Japan.
⁶ Department of Bioengineering and Therapeutic Sciences, University of California, 1700 4th Street, San Francisco, CA 94158, USA.
⁷ Institute for Human Genetics, University of California, 513 Parnassus Ave, San Francisco, CA 94143, USA.
⁸ Department of Systems Immunology, Weizmann Institute of Science, 234 Herzl Street, Rehovot 7610001, Israel.
⁹ Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA.
¹⁰ Department of Systems Immunology, Ragon Institute of MGH, MIT, and Harvard Institute of Science, 400 Technology Square, Cambridge, MA 02139, USA.

PMID: 38296821
PMCID: PMC10939410
DOI: 10.1093/nar/gkae012

Abstract

The advent of perturbation-based massively parallel reporter assays (MPRAs) technique has facilitated the delineation of the roles of non-coding regulatory elements in orchestrating gene expression. However, computational efforts remain scant to evaluate and establish guidelines for sequence design strategies for perturbation MPRAs. In this study, we propose a framework for evaluating and comparing various perturbation strategies for MPRA experiments. Within this framework, we benchmark three different perturbation approaches from the perspectives of alteration in motif-based profiles, consistency of MPRA outputs, and robustness of models that predict the activities of putative regulatory motifs. While our analyses show very similar results across multiple benchmarking metrics, the predictive modeling for the approach involving random nucleotide shuffling shows significant robustness compared with the other two approaches. Thus, we recommend designing sequences by randomly shuffling the nucleotides of the perturbed site in perturbation-MPRA, followed by a coherence check to prevent the introduction of other variations of the target motifs. In summary, our evaluation framework and the benchmarking findings create a resource of computational pipelines and highlight the potential of perturbation-MPRA in predicting non-coding regulatory activities.

PubMed Disclaimer

Figures

**Figure 1.**
An outline of the framework for evaluation of perturbation-based massively parallel assays technique. In the ‘Three sequence designing approaches’ box, we used the ‘GATA_known9’ motif as an example. In detail, the GATA motifs are a group of sequences conforming to the consensus WGATAR (W = A or T and R = A or G) (marked by the wavy underline), that can be recognized and bound by GATAbinding transcription factors (45).

**Figure 2.**
Evaluations of perturbation-wise metrics. (A) Examples of ‘hit’ and ‘fail’ sequences. Please refer to the Supplementary Notes for the full perturbation sequences. (B) A comparison of hit rates among three perturbation approaches. The ‘N/A’ category represents the sequences that are excluded from this study because their barcodes failed the sequencing quality check (Supplementary Notes). (C) Examples of ‘perturbed’ and ‘non-perturbed’ sequences. (D) A comparison of perturbation rates among three perturbation approaches.

**Figure 3.**
Evaluations of motif-based metrics. (A) An example of calculating perturbation specificity. Refer to the Supplementary Notes for the full perturbation sequences. (B) A comparison of perturbation specificity among three perturbation approaches. Significant P values (P < 0.05) are shown in red. (C) An example of calculating ‘newly introduced target motifs per sequence’. Please refer to the Supplementary Notes for the full perturbation sequences. (D) A comparison of ‘newly introduced target motifs per sequence’ among three perturbation approaches.

**Figure 4.**
Evaluation of general alteration in the number of motifs. (A) Toy examples of calculating general alteration in the number of motifs. (B–D) The results for motif perturbations: (B) the number of gained motifs, (C) the number of lost motifs and (D) the net change in the number of motifs. Significant P values (P < 0.05) are shown in red. (E–G) The results for random perturbation sequences: (E) number of gained motifs, (F) number of lost motifs and (G) net change in the number of motifs. Significant P values (P < 0.05) are shown in red.

**Figure 5.**
Assessment of the important features representing perturbation sequences. (A) The number of important features shared by three perturbation approaches. (B) Top 30 important features of each perturbation approach. The names of features that are shared by at least two perturbation approaches are marked in bold. (C) Gene ontology enrichment analysis of the top 2500 genes represented by the TF binding factors.

**Figure 6.**
Consistency of MPRA outputs among three perturbations. (A) Number of sequences that share the same FRS identities. The bars are colored by activators (red) and repressors (blue). In the ‘intersection type’ matrix. The percentages are row-normalized, indicating the proportion of sequences belonging to different intersection types within each perturbation approach. (B) The correlation of Log₂FC between motif_PERT1 and motif_PERT2. Each dot is a perturbation sequence and is colored by the time point. (C) The correlation of Log2FC between motif_PERT1 and motif_PERT3. (D) The correlation of Log2FC between motif_PERT2 and motif_PERT3.

**Figure 7.**
Comparison of the Log₂FC among three perturbation approaches. The Log₂FC values are separated by time point before being compared among three perturbation approaches.

**Figure 8.**
Performance of classification models. (A) The area under the receiver-operating characteristic curve (AUROC) of different classification models. Asterisks/ns indicate levels of statistical significance, calculated by pairwise Wilcoxon rank sum tests (P-value < 0.05*, < 0.01**, < 0.001***, < 0.0001****; ns, non-significant). (B) A summary of the mean ± standard deviation values for AUROCs of classification models.

**Figure 9.**
Performance of regression models. (A) The Pearson correlation coefficients of different regression models. Asterisks/ns indicate levels of statistical significance, calculated by pairwise Wilcoxon rank sum tests (P-value < 0.05*, < 0.01**, < 0.001***, < 0.0001****; ns, non-significant). (B) A summary of the mean ± standard deviation values for Pearson correlation coefficients of regression models.

See this image and copyright information in PMC

Update of

Best practices for perturbation MPRA-a computational evaluation framework of sequence design strategies.
Liu J, Ashuach T, Inoue F, Ahituv N, Yosef N, Kreimer A. Liu J, et al. bioRxiv [Preprint]. 2023 Sep 29:2023.09.27.559768. doi: 10.1101/2023.09.27.559768. bioRxiv. 2023. Update in: Nucleic Acids Res. 2024 Feb 28;52(4):1613-1627. doi: 10.1093/nar/gkae012. PMID: 37808807 Free PMC article. Updated. Preprint.

Cited by

Comprehensive network modeling approaches unravel dynamic enhancer-promoter interactions across neural differentiation.
DeGroat W, Inoue F, Ashuach T, Yosef N, Ahituv N, Kreimer A. DeGroat W, et al. bioRxiv [Preprint]. 2024 May 23:2024.05.22.595375. doi: 10.1101/2024.05.22.595375. bioRxiv. 2024. Update in: Genome Biol. 2024 Aug 14;25(1):221. doi: 10.1186/s13059-024-03365-w. PMID: 38826254 Free PMC article. Updated. Preprint.
Comprehensive network modeling approaches unravel dynamic enhancer-promoter interactions across neural differentiation.
DeGroat W, Inoue F, Ashuach T, Yosef N, Ahituv N, Kreimer A. DeGroat W, et al. Genome Biol. 2024 Aug 14;25(1):221. doi: 10.1186/s13059-024-03365-w. Genome Biol. 2024. PMID: 39143563 Free PMC article.

References

1. Rheinbay E., Nielsen M.M., Abascal F., Wala J.A., Shapira O., Tiao G., Hornshøj H., Hess J.M., Juul R.I., Lin Z. et al. . Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature. 2020; 578:102–111. - PMC - PubMed
1. Agarwal V., Inoue F., Schubach M., Martin B.K., Dash P.M., Zhang Z., Sohota A., Noble W.S., Yardimci G.G., Kircher M. et al. . Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types. 2023; bioRxiv doi:06 March 2023, preprint: not peer reviewed10.1101/2023.03.05.531189. - DOI
1. Koesterich J., An J.-Y., Inoue F., Sohota A., Ahituv N., Sanders S.J., Kreimer A. Characterization of de novo promoter variants in autism spectrum disorder with massively parallel reporter assays. Int. J. Mol. Sci. 2023; 24:3509. - PMC - PubMed
1. Deng C., Whalen S., Steyert M., Ziffra R., Przytycki P.F., Inoue F., Pereira D.A., Capauto D., Norton S., Vaccarino F.M. et al. . Massively parallel characterization of psychiatric disorder-associated and cell-type-specific regulatory elements in the developing human cortex. 2023; bioRxiv doi:16 February 2023, preprint: not peer reviewed10.1101/2023.02.15.528663. - DOI - PubMed
1. Koh K.D., Bonser L.R., Eckalbar W.L., Yizhar-Barnea O., Shen J., Zeng X., Hargett K.L., Sun D.I., Zlock L.T., Finkbeiner W.E. et al. . Genomic characterization and therapeutic utilization of IL-13-responsive sequences in asthma. Cell Genom. 2022; 3:100229. - PMC - PubMed

MeSH terms

Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimizing sequence design strategies for perturbation MPRAs: a computational evaluation framework

Affiliations

Optimizing sequence design strategies for perturbation MPRAs: a computational evaluation framework

Authors

Affiliations

Abstract

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases