Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 28;52(4):1613-1627.
doi: 10.1093/nar/gkae012.

Optimizing sequence design strategies for perturbation MPRAs: a computational evaluation framework

Affiliations

Optimizing sequence design strategies for perturbation MPRAs: a computational evaluation framework

Jiayi Liu et al. Nucleic Acids Res. .

Abstract

The advent of perturbation-based massively parallel reporter assays (MPRAs) technique has facilitated the delineation of the roles of non-coding regulatory elements in orchestrating gene expression. However, computational efforts remain scant to evaluate and establish guidelines for sequence design strategies for perturbation MPRAs. In this study, we propose a framework for evaluating and comparing various perturbation strategies for MPRA experiments. Within this framework, we benchmark three different perturbation approaches from the perspectives of alteration in motif-based profiles, consistency of MPRA outputs, and robustness of models that predict the activities of putative regulatory motifs. While our analyses show very similar results across multiple benchmarking metrics, the predictive modeling for the approach involving random nucleotide shuffling shows significant robustness compared with the other two approaches. Thus, we recommend designing sequences by randomly shuffling the nucleotides of the perturbed site in perturbation-MPRA, followed by a coherence check to prevent the introduction of other variations of the target motifs. In summary, our evaluation framework and the benchmarking findings create a resource of computational pipelines and highlight the potential of perturbation-MPRA in predicting non-coding regulatory activities.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
An outline of the framework for evaluation of perturbation-based massively parallel assays technique. In the ‘Three sequence designing approaches’ box, we used the ‘GATA_known9’ motif as an example. In detail, the GATA motifs are a group of sequences conforming to the consensus WGATAR (W = A or T and R = A or G) (marked by the wavy underline), that can be recognized and bound by GATAbinding transcription factors (45).
Figure 2.
Figure 2.
Evaluations of perturbation-wise metrics. (A) Examples of ‘hit’ and ‘fail’ sequences. Please refer to the Supplementary Notes for the full perturbation sequences. (B) A comparison of hit rates among three perturbation approaches. The ‘N/A’ category represents the sequences that are excluded from this study because their barcodes failed the sequencing quality check (Supplementary Notes). (C) Examples of ‘perturbed’ and ‘non-perturbed’ sequences. (D) A comparison of perturbation rates among three perturbation approaches.
Figure 3.
Figure 3.
Evaluations of motif-based metrics. (A) An example of calculating perturbation specificity. Refer to the Supplementary Notes for the full perturbation sequences. (B) A comparison of perturbation specificity among three perturbation approaches. Significant P values (P < 0.05) are shown in red. (C) An example of calculating ‘newly introduced target motifs per sequence’. Please refer to the Supplementary Notes for the full perturbation sequences. (D) A comparison of ‘newly introduced target motifs per sequence’ among three perturbation approaches.
Figure 4.
Figure 4.
Evaluation of general alteration in the number of motifs. (A) Toy examples of calculating general alteration in the number of motifs. (B–D) The results for motif perturbations: (B) the number of gained motifs, (C) the number of lost motifs and (D) the net change in the number of motifs. Significant P values (P < 0.05) are shown in red. (E–G) The results for random perturbation sequences: (E) number of gained motifs, (F) number of lost motifs and (G) net change in the number of motifs. Significant P values (P < 0.05) are shown in red.
Figure 5.
Figure 5.
Assessment of the important features representing perturbation sequences. (A) The number of important features shared by three perturbation approaches. (B) Top 30 important features of each perturbation approach. The names of features that are shared by at least two perturbation approaches are marked in bold. (C) Gene ontology enrichment analysis of the top 2500 genes represented by the TF binding factors.
Figure 6.
Figure 6.
Consistency of MPRA outputs among three perturbations. (A) Number of sequences that share the same FRS identities. The bars are colored by activators (red) and repressors (blue). In the ‘intersection type’ matrix. The percentages are row-normalized, indicating the proportion of sequences belonging to different intersection types within each perturbation approach. (B) The correlation of Log2FC between motif_PERT1 and motif_PERT2. Each dot is a perturbation sequence and is colored by the time point. (C) The correlation of Log2FC between motif_PERT1 and motif_PERT3. (D) The correlation of Log2FC between motif_PERT2 and motif_PERT3.
Figure 7.
Figure 7.
Comparison of the Log2FC among three perturbation approaches. The Log2FC values are separated by time point before being compared among three perturbation approaches.
Figure 8.
Figure 8.
Performance of classification models. (A) The area under the receiver-operating characteristic curve (AUROC) of different classification models. Asterisks/ns indicate levels of statistical significance, calculated by pairwise Wilcoxon rank sum tests (P-value < 0.05*, < 0.01**, < 0.001***, < 0.0001****; ns, non-significant). (B) A summary of the mean ± standard deviation values for AUROCs of classification models.
Figure 9.
Figure 9.
Performance of regression models. (A) The Pearson correlation coefficients of different regression models. Asterisks/ns indicate levels of statistical significance, calculated by pairwise Wilcoxon rank sum tests (P-value < 0.05*, < 0.01**, < 0.001***, < 0.0001****; ns, non-significant). (B) A summary of the mean ± standard deviation values for Pearson correlation coefficients of regression models.

Update of

Similar articles

Cited by

References

    1. Rheinbay E., Nielsen M.M., Abascal F., Wala J.A., Shapira O., Tiao G., Hornshøj H., Hess J.M., Juul R.I., Lin Z. et al. . Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature. 2020; 578:102–111. - PMC - PubMed
    1. Agarwal V., Inoue F., Schubach M., Martin B.K., Dash P.M., Zhang Z., Sohota A., Noble W.S., Yardimci G.G., Kircher M. et al. . Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types. 2023; bioRxiv doi:06 March 2023, preprint: not peer reviewed10.1101/2023.03.05.531189. - DOI
    1. Koesterich J., An J.-Y., Inoue F., Sohota A., Ahituv N., Sanders S.J., Kreimer A. Characterization of de novo promoter variants in autism spectrum disorder with massively parallel reporter assays. Int. J. Mol. Sci. 2023; 24:3509. - PMC - PubMed
    1. Deng C., Whalen S., Steyert M., Ziffra R., Przytycki P.F., Inoue F., Pereira D.A., Capauto D., Norton S., Vaccarino F.M. et al. . Massively parallel characterization of psychiatric disorder-associated and cell-type-specific regulatory elements in the developing human cortex. 2023; bioRxiv doi:16 February 2023, preprint: not peer reviewed10.1101/2023.02.15.528663. - DOI - PubMed
    1. Koh K.D., Bonser L.R., Eckalbar W.L., Yizhar-Barnea O., Shen J., Zeng X., Hargett K.L., Sun D.I., Zlock L.T., Finkbeiner W.E. et al. . Genomic characterization and therapeutic utilization of IL-13-responsive sequences in asthma. Cell Genom. 2022; 3:100229. - PMC - PubMed