. 2020 Aug;15(8):2387-2412.

doi: 10.1038/s41596-020-0333-5. Epub 2020 Jul 8.

lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

M Grace Gordon^#^{1

2

3}, Fumitaka Inoue^#^{4

5}, Beth Martin^#⁶, Max Schubach^#^{7

8}, Vikram Agarwal^{6

9}, Sean Whalen¹⁰, Shiyun Feng^{1

2}, Jingjing Zhao^{1

2}, Tal Ashuach¹¹, Ryan Ziffra^{1

2}, Anat Kreimer^{1

2

11}, Ilias Georgakopoulos-Soares^{1

2}, Nir Yosef^{11

12}, Chun Jimmie Ye^{1

2

12

13

14}, Katherine S Pollard^{2

10

12

15}, Jay Shendure^{16

17

18}, Martin Kircher^{19

20}, Nadav Ahituv^{21

22}

Affiliations

¹ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA.
² Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
³ Biological and Medical Informatics Graduate Program, University of California, San Francisco, San Francisco, CA, USA.
⁴ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA. fumitaka.inoue@ucsf.edu.
⁵ Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA. fumitaka.inoue@ucsf.edu.
⁶ Department of Genome Sciences, University of Washington, Seattle, WA, USA.
⁷ Berlin Institute of Health (BIH), Berlin, Germany.
⁸ Charité-Universitätsmedizin Berlin, Berlin, Germany.
⁹ Calico Life Sciences LLC, South San Francisco, CA, USA.
¹⁰ Gladstone Institutes, San Francisco, CA, USA.
¹¹ Department of Electrical Engineering and Computer Sciences and Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
¹² Chan-Zuckerberg Biohub, San Francisco, CA, USA.
¹³ Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
¹⁴ Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, California, USA.
¹⁵ Department of Epidemiology and Biostatistics and Institute of Computational Health Sciences, University of California, San Francisco, San Francisco, CA, USA.
¹⁶ Department of Genome Sciences, University of Washington, Seattle, WA, USA. shendure@uw.edu.
¹⁷ Howard Hughes Medical Institute, Seattle, WA, USA. shendure@uw.edu.
¹⁸ Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA. shendure@uw.edu.
¹⁹ Berlin Institute of Health (BIH), Berlin, Germany. martin.kircher@bihealth.de.
²⁰ Charité-Universitätsmedizin Berlin, Berlin, Germany. martin.kircher@bihealth.de.
²¹ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA. nadav.ahituv@ucsf.edu.
²² Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA. nadav.ahituv@ucsf.edu.

^# Contributed equally.

PMID: 32641802
PMCID: PMC7550205
DOI: 10.1038/s41596-020-0333-5

lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

M Grace Gordon et al. Nat Protoc. 2020 Aug.

. 2020 Aug;15(8):2387-2412.

doi: 10.1038/s41596-020-0333-5. Epub 2020 Jul 8.

Authors

Affiliations

¹ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA.
² Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
³ Biological and Medical Informatics Graduate Program, University of California, San Francisco, San Francisco, CA, USA.
⁴ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA. fumitaka.inoue@ucsf.edu.
⁵ Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA. fumitaka.inoue@ucsf.edu.
⁶ Department of Genome Sciences, University of Washington, Seattle, WA, USA.
⁷ Berlin Institute of Health (BIH), Berlin, Germany.
⁸ Charité-Universitätsmedizin Berlin, Berlin, Germany.
⁹ Calico Life Sciences LLC, South San Francisco, CA, USA.
¹⁰ Gladstone Institutes, San Francisco, CA, USA.
¹¹ Department of Electrical Engineering and Computer Sciences and Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
¹² Chan-Zuckerberg Biohub, San Francisco, CA, USA.
¹³ Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
¹⁴ Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, California, USA.
¹⁵ Department of Epidemiology and Biostatistics and Institute of Computational Health Sciences, University of California, San Francisco, San Francisco, CA, USA.
¹⁶ Department of Genome Sciences, University of Washington, Seattle, WA, USA. shendure@uw.edu.
¹⁷ Howard Hughes Medical Institute, Seattle, WA, USA. shendure@uw.edu.
¹⁸ Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA. shendure@uw.edu.
¹⁹ Berlin Institute of Health (BIH), Berlin, Germany. martin.kircher@bihealth.de.
²⁰ Charité-Universitätsmedizin Berlin, Berlin, Germany. martin.kircher@bihealth.de.
²¹ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA. nadav.ahituv@ucsf.edu.
²² Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA. nadav.ahituv@ucsf.edu.

^# Contributed equally.

PMID: 32641802
PMCID: PMC7550205
DOI: 10.1038/s41596-020-0333-5

Erratum in

Author Correction: lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements.
Gordon MG, Inoue F, Martin B, Schubach M, Agarwal V, Whalen S, Feng S, Zhao J, Ashuach T, Ziffra R, Kreimer A, Georgakopoulos-Soares I, Yosef N, Ye CJ, Pollard KS, Shendure J, Kircher M, Ahituv N. Gordon MG, et al. Nat Protoc. 2021 Jul;16(7):3736. doi: 10.1038/s41596-020-00422-z. Nat Protoc. 2021. PMID: 33128032 No abstract available.

Abstract

Massively parallel reporter assays (MPRAs) can simultaneously measure the function of thousands of candidate regulatory sequences (CRSs) in a quantitative manner. In this method, CRSs are cloned upstream of a minimal promoter and reporter gene, alongside a unique barcode, and introduced into cells. If the CRS is a functional regulatory element, it will lead to the transcription of the barcode sequence, which is measured via RNA sequencing and normalized for cellular integration via DNA sequencing of the barcode. This technology has been used to test thousands of sequences and their variants for regulatory activity, to decipher the regulatory code and its evolution, and to develop genetic switches. Lentivirus-based MPRA (lentiMPRA) produces 'in-genome' readouts and enables the use of this technique in hard-to-transfect cells. Here, we provide a detailed protocol for lentiMPRA, along with a user-friendly Nextflow-based computational pipeline-MPRAflow-for quantifying CRS activity from different MPRA designs. The lentiMPRA protocol takes ~2 months, which includes sequencing turnaround time and data processing with MPRAflow.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

**Extended Data Fig. 1 |. Sequence scheme of lentiMPRA.**
a, Synthesized CRS oligo sequence. b, Primers and their binding in 1^st and 2^nd round PCR for library amplification. c, Recombination and plasmid library sequence. d, Primers and their binding in library amplification and sequencing for CRS-barcode association. e, Primers and their binding in reverse transcription, library amplification and sequencing for barcode counting.

**Extended Data Fig. 2 |. Time complexity study of MPRAflow.**
a, The Association Utility run time scales with number of reads when holding the number of FASTQ chunks at 2M reads. As this is an alignment the memory requirements are not trivial, requiring approximately 1GB of memory per 3M reads. b, The Count Utility run time scales with number of reads divided by the number of experiments running in parallel. This step does not require much memory, where 500M reads can be processed in <0.5GB.

**Fig. 1 |. Schematics of lentiMPRA.**
a, Summary of lentiMPRA and MPRAflow. The lentiMPRA library is sequenced to associate CRSs and barcodes and to infect cells, using three replicates. DNA and RNA from the cells are sequenced to determine barcode transcription and CRS activity. b, CRS oligonucleotide. A 200-base CRS (gray) is flanked by PCR adaptor sequences (light green). c, First-round PCR. PCR primers add sequences that are complementary to the vector (black) to the upstream side, as well as minimal promoter (mP, blue) and spacer sequences (yellow) downstream of the CRS oligonucleotide. d, Second-round PCR. Reverse primer adds the barcodes (red-striped section) and GFP complementary sequences (green). e, Plasmid construct. f, Amplification for CRS-barcode association. Primers add P5 (purple) and sample index (gray-striped section) upstream and P7 (pink) downstream. g, Sequencing library structure. h, Sequencing reaction. Paired-end reads specify the CRS sequence, with index read 1 providing the barcode and index read 2 reading the sample index for multiplexing. i, Integrated DNA and expressed RNA in infected cells. j, Amplification for barcode counting. Primers add P5 and sample index upstream and P7 and UMI, brown stripe) downstream. k, Sequencing library structure. l, Sequencing reaction. Paired-end reads give barcode, index read 1 gives UMI, and index read 2 provides sample index for multiplexing. ARE, anti-repressor element; LTR, long terminal repeat; WPRE, Woodchuck hepatitis virus posttranscriptional regulatory element.

**Fig. 2 |. Overview of MPRAflow association utility.**
a, Mandatory inputs (blue), optional flags (orange), output files (green) and utility (red). The program requires .fastq files for the insert, either single-end (SE) or paired-end (PE) reads, and a design file, which is a .fasta file containing the synthesized oligonucleotides. The user can also specify a tab-delimited file with a mapping of CRS names given in the design file and a grouping, such as a control category (e.g., positive or negative control), a .tsv file of variants in the ordered oligonucleotide pool to be used for a tailored alignment strategy, and can accept various parameters for filtering the pairing based on mapping qualities and number of observed barcodes mapping to the CRS. The program outputs a Python dictionary in pickle format, mapping barcodes to their CRS. b, A violin plot of barcode coverage for each enhancer, grouped by labels provided in the label .tsv file. The violin plot features a kernel density (blue, yellow, and green), showing the underlying distribution of the data, and a boxplot. In the boxplot, the white dot is the median, the box represents the interquartile range (IQR), and the whiskers are 1.5 × IQR. Outliers are represented as points. BC, barcode.

**Fig. 3 |. Overview of count utility.**
a, Mandatory inputs (blue), optional flags and outputs (orange), output files (green) and utility (red). The user must specify the directory containing all .fastq files for the RNA and DNA sequencing, the CRS-barcode dictionary from the association utility, a design file (.fasta file containing the synthesized oligonucleotides), and an experimental comma-separated file (CSV) outlining the number of replicates and conditions used. The user can also specify a tab-delimited file with a mapping of CRS names given in the design file and a grouping, such as control category (e.g., positive or negative control), and tune parameters, for example, to specify whether a UMI was used or whether the user would like to generate the input files for MPRAnalyze. b-e, The program will produce normalized activity of each CRS from each replicate, as well as across replicates, along with several visualizations. b, CRS activity normalized by insert and grouped by label determined in the label file. The violin plot features a kernel density, showing the underlying distribution of the data and a boxplot. In the boxplot, the center line is the median, the box represents the interquartile range (IQR), and the whiskers are 1.5 × IQR. Outliers are represented as points. c, Normalized activity of each CRS across replicates, colored by label and represented as a boxplot across replicates, where the box represents the IQR, and the whiskers are 1.5 × IQR. Outliers are represented as points. d, Distribution of observed barcode coverage per CRS in each replicate. The mean number of barcodes tagging each CRS is shown in red. e, Correlation of normalized log₂(RNA/DNA), DNA counts and RNA counts. BC, barcode.

**Fig. 4 |. Overview of saturation mutagenesis utility.**
a, Mandatory inputs (blue), optional flags and outputs (orange), output files (green), and utility (red). The user must specify the directory containing all barcode count files, including DNA and RNA counts, the variant to barcode assignment file, and an experimental comma-separated file outlining the number of replicates and conditions used. The user can also set UMI and P-value thresholds to be used for filtering variants and distinguishing between significant and not-significant variant effects. The program will produce log₂ variant effects, P values and a visual output of correlation, as well as a saturation mutagenesis variant effect plot of the region. b, Correlation between replicates. Here, we show the correlation between three replicates of the *TERT* promoter in a glioblastoma cell line from Kircher et al. 2019 (ref. ). ρ is the Pearson correlation between two samples (model with 1-bp indels). Only variants with ≥10 barcodes are shown. c, Saturation mutagenesis effect plot of the combined model from three replicates of the *TERT* promoter in a glioblastoma cell line from Kircher et al. 2019 (including 1-bp indels). ‘Position’ refers to the variant position of the original target insert. Only variants with ≥10 barcodes are shown. Significance level is P < 1 × 10⁻⁵

See this image and copyright information in PMC

References

1. Chatterjee S & Ahituv N Gene regulatory elements, major drivers of human disease. Annu. Rev. Genomics Hum. Genet 18, 45–63 (2017). - PubMed
1. Manolio TA et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009). - PMC - PubMed
1. Maurano MT et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). - PMC - PubMed
1. Carroll SB Evolution at two levels: on genes and form. PLoS Biol. 3, e245 (2005). - PMC - PubMed
1. Johnson DS, Mortazavi A, Myers RM & Wold B Genome-wide mapping of in vivo protein- DNA interactions. Science 316, 1497–1502 (2007). - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- Addgene Non-profit plasmid repository

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

Affiliations

lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials