Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug;15(8):2387-2412.
doi: 10.1038/s41596-020-0333-5. Epub 2020 Jul 8.

lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

Affiliations

lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

M Grace Gordon et al. Nat Protoc. 2020 Aug.

Erratum in

Abstract

Massively parallel reporter assays (MPRAs) can simultaneously measure the function of thousands of candidate regulatory sequences (CRSs) in a quantitative manner. In this method, CRSs are cloned upstream of a minimal promoter and reporter gene, alongside a unique barcode, and introduced into cells. If the CRS is a functional regulatory element, it will lead to the transcription of the barcode sequence, which is measured via RNA sequencing and normalized for cellular integration via DNA sequencing of the barcode. This technology has been used to test thousands of sequences and their variants for regulatory activity, to decipher the regulatory code and its evolution, and to develop genetic switches. Lentivirus-based MPRA (lentiMPRA) produces 'in-genome' readouts and enables the use of this technique in hard-to-transfect cells. Here, we provide a detailed protocol for lentiMPRA, along with a user-friendly Nextflow-based computational pipeline-MPRAflow-for quantifying CRS activity from different MPRA designs. The lentiMPRA protocol takes ~2 months, which includes sequencing turnaround time and data processing with MPRAflow.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

Extended Data Fig. 1 |
Extended Data Fig. 1 |. Sequence scheme of lentiMPRA.
a, Synthesized CRS oligo sequence. b, Primers and their binding in 1st and 2nd round PCR for library amplification. c, Recombination and plasmid library sequence. d, Primers and their binding in library amplification and sequencing for CRS-barcode association. e, Primers and their binding in reverse transcription, library amplification and sequencing for barcode counting.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Time complexity study of MPRAflow.
a, The Association Utility run time scales with number of reads when holding the number of FASTQ chunks at 2M reads. As this is an alignment the memory requirements are not trivial, requiring approximately 1GB of memory per 3M reads. b, The Count Utility run time scales with number of reads divided by the number of experiments running in parallel. This step does not require much memory, where 500M reads can be processed in <0.5GB.
Fig. 1 |
Fig. 1 |. Schematics of lentiMPRA.
a, Summary of lentiMPRA and MPRAflow. The lentiMPRA library is sequenced to associate CRSs and barcodes and to infect cells, using three replicates. DNA and RNA from the cells are sequenced to determine barcode transcription and CRS activity. b, CRS oligonucleotide. A 200-base CRS (gray) is flanked by PCR adaptor sequences (light green). c, First-round PCR. PCR primers add sequences that are complementary to the vector (black) to the upstream side, as well as minimal promoter (mP, blue) and spacer sequences (yellow) downstream of the CRS oligonucleotide. d, Second-round PCR. Reverse primer adds the barcodes (red-striped section) and GFP complementary sequences (green). e, Plasmid construct. f, Amplification for CRS-barcode association. Primers add P5 (purple) and sample index (gray-striped section) upstream and P7 (pink) downstream. g, Sequencing library structure. h, Sequencing reaction. Paired-end reads specify the CRS sequence, with index read 1 providing the barcode and index read 2 reading the sample index for multiplexing. i, Integrated DNA and expressed RNA in infected cells. j, Amplification for barcode counting. Primers add P5 and sample index upstream and P7 and UMI, brown stripe) downstream. k, Sequencing library structure. l, Sequencing reaction. Paired-end reads give barcode, index read 1 gives UMI, and index read 2 provides sample index for multiplexing. ARE, anti-repressor element; LTR, long terminal repeat; WPRE, Woodchuck hepatitis virus posttranscriptional regulatory element.
Fig. 2 |
Fig. 2 |. Overview of MPRAflow association utility.
a, Mandatory inputs (blue), optional flags (orange), output files (green) and utility (red). The program requires .fastq files for the insert, either single-end (SE) or paired-end (PE) reads, and a design file, which is a .fasta file containing the synthesized oligonucleotides. The user can also specify a tab-delimited file with a mapping of CRS names given in the design file and a grouping, such as a control category (e.g., positive or negative control), a .tsv file of variants in the ordered oligonucleotide pool to be used for a tailored alignment strategy, and can accept various parameters for filtering the pairing based on mapping qualities and number of observed barcodes mapping to the CRS. The program outputs a Python dictionary in pickle format, mapping barcodes to their CRS. b, A violin plot of barcode coverage for each enhancer, grouped by labels provided in the label .tsv file. The violin plot features a kernel density (blue, yellow, and green), showing the underlying distribution of the data, and a boxplot. In the boxplot, the white dot is the median, the box represents the interquartile range (IQR), and the whiskers are 1.5 × IQR. Outliers are represented as points. BC, barcode.
Fig. 3 |
Fig. 3 |. Overview of count utility.
a, Mandatory inputs (blue), optional flags and outputs (orange), output files (green) and utility (red). The user must specify the directory containing all .fastq files for the RNA and DNA sequencing, the CRS-barcode dictionary from the association utility, a design file (.fasta file containing the synthesized oligonucleotides), and an experimental comma-separated file (CSV) outlining the number of replicates and conditions used. The user can also specify a tab-delimited file with a mapping of CRS names given in the design file and a grouping, such as control category (e.g., positive or negative control), and tune parameters, for example, to specify whether a UMI was used or whether the user would like to generate the input files for MPRAnalyze. b-e, The program will produce normalized activity of each CRS from each replicate, as well as across replicates, along with several visualizations. b, CRS activity normalized by insert and grouped by label determined in the label file. The violin plot features a kernel density, showing the underlying distribution of the data and a boxplot. In the boxplot, the center line is the median, the box represents the interquartile range (IQR), and the whiskers are 1.5 × IQR. Outliers are represented as points. c, Normalized activity of each CRS across replicates, colored by label and represented as a boxplot across replicates, where the box represents the IQR, and the whiskers are 1.5 × IQR. Outliers are represented as points. d, Distribution of observed barcode coverage per CRS in each replicate. The mean number of barcodes tagging each CRS is shown in red. e, Correlation of normalized log2(RNA/DNA), DNA counts and RNA counts. BC, barcode.
Fig. 4 |
Fig. 4 |. Overview of saturation mutagenesis utility.
a, Mandatory inputs (blue), optional flags and outputs (orange), output files (green), and utility (red). The user must specify the directory containing all barcode count files, including DNA and RNA counts, the variant to barcode assignment file, and an experimental comma-separated file outlining the number of replicates and conditions used. The user can also set UMI and P-value thresholds to be used for filtering variants and distinguishing between significant and not-significant variant effects. The program will produce log2 variant effects, P values and a visual output of correlation, as well as a saturation mutagenesis variant effect plot of the region. b, Correlation between replicates. Here, we show the correlation between three replicates of the TERT promoter in a glioblastoma cell line from Kircher et al. 2019 (ref. ). ρ is the Pearson correlation between two samples (model with 1-bp indels). Only variants with ≥10 barcodes are shown. c, Saturation mutagenesis effect plot of the combined model from three replicates of the TERT promoter in a glioblastoma cell line from Kircher et al. 2019 (including 1-bp indels). ‘Position’ refers to the variant position of the original target insert. Only variants with ≥10 barcodes are shown. Significance level is P < 1 × 10−5

References

    1. Chatterjee S & Ahituv N Gene regulatory elements, major drivers of human disease. Annu. Rev. Genomics Hum. Genet 18, 45–63 (2017). - PubMed
    1. Manolio TA et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009). - PMC - PubMed
    1. Maurano MT et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). - PMC - PubMed
    1. Carroll SB Evolution at two levels: on genes and form. PLoS Biol. 3, e245 (2005). - PMC - PubMed
    1. Johnson DS, Mortazavi A, Myers RM & Wold B Genome-wide mapping of in vivo protein- DNA interactions. Science 316, 1497–1502 (2007). - PubMed

Publication types

MeSH terms