Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 8;187(16):4408-4425.e23.
doi: 10.1016/j.cell.2024.06.005. Epub 2024 Jun 25.

Multiplexed single-cell characterization of alternative polyadenylation regulators

Affiliations

Multiplexed single-cell characterization of alternative polyadenylation regulators

Madeline H Kowalski et al. Cell. .

Abstract

Most mammalian genes have multiple polyA sites, representing a substantial source of transcript diversity regulated by the cleavage and polyadenylation (CPA) machinery. To better understand how these proteins govern polyA site choice, we introduce CPA-Perturb-seq, a multiplexed perturbation screen dataset of 42 CPA regulators with a 3' scRNA-seq readout that enables transcriptome-wide inference of polyA site usage. We develop a framework to detect perturbation-dependent changes in polyadenylation and characterize modules of co-regulated polyA sites. We find groups of intronic polyA sites regulated by distinct components of the nuclear RNA life cycle, including elongation, splicing, termination, and surveillance. We train and validate a deep neural network (APARENT-Perturb) for tandem polyA site usage, delineating a cis-regulatory code that predicts perturbation response and reveals interactions between regulatory complexes. Our work highlights the potential for multiplexed single-cell perturbation screens to further our understanding of post-transcriptional regulation.

Keywords: Perturb-seq; RNA processing; alternative polyadenylation; cleavage and polyadenylation; post-transcriptional regulation.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests In the past 3 years, R.S. has received compensation from Bristol-Myers Squibb, ImmunAI, Resolve Biosciences, Nanostring, 10x Genomics, Neptune Bio, and the NYC Pandemic Response Lab. R.S., H.-H.W., and Y.H. are cofounders and equity holders of Neptune Bio. A.K. is a scientific cofounder of Ravel Biotechnology; is on the scientific advisory board of PatchBio, SerImmune, AINovo, TensorBio, and OpenTargets; is a consultant with Illumina; and owns shares in DeepGenomics, Immunai, and Freenome. J.L. is an employee of Calico Life Sciences, LLC, as of 11/21/2022. H.-H.W. and Y.H. are employees of Neptune Bio as of 8/1/2023.

Figures

Figure 1:
Figure 1:. Overview of CPA-Perturb-Seq.
(A, Top) Schematic of the experimental workflow used to generate the CPA-Perturb-seq dataset. (Bottom) Schematic of perturbation-dependent changes in either tandem or intronic polyadenylation. (B) Diagram depicting core regulatory complexes that make up and interact with the cleavage and polyadenylation machinery. (C) Read coverage plots depicting the differential use of alternative polyA sites at the CBX3 locus. Each track represents a pseudobulk average of cells, grouped by their perturbation. ENSEMBL gene models and peaks (quantification region) that precede detected polyA sites are shown below. (D) UMAP visualization of HEK293FT cells profiled via CPA-Perturb-Seq. Cells are colored based on the target gene identity, using the same colors as in (C). Visualization was computed based on a linear discriminant analysis (LDA) of transcriptome-wide polyA site counts. See also Figure S1–S3.
Figure 2:
Figure 2:. PolyA-residuals quantify alternative polyadenylation at single-cell resolution.
(A) Average usage of 6,019 proximal polyA sites in NT cells (x-axis) and CSTF3-perturbed cells (y-axis). Only genes with at least two tandem polyA sites are considered. Changes across conditions can reflect either changes in relative polyA site usage, total gene expression, or both. (B-D): Read coverage plots at three loci highlighted in (A). Shading marks the proximal polyA site. (E) Schematic depicting the procedure to calculate polyA-residuals (full description in STAR Methods). (F-H) Violin plots depicting single-cell gene expression levels (left) or single-cell polyA-residuals for the proximal polyA site for NT and CSTF3-perturbed cells. NS (not significant) for RNA comparisons indicate absolute log2FC <0.25 or Bonferroni adjusted p-value >0.05 using Wilcoxon rank-sum test. NS for polyA residual comparisons indicates percent change <0.05 or adjusted p-value >0.05 in differential polyadenylation analysis described in STAR Methods.
Figure 3:
Figure 3:. Characterizing tandem and intronic alternative polyadenylation in CPA-Perturb-Seq
(A, left) Number of genes with changes in either polyA site usage, gene expression, or both after perturbation of each regulator in the HEK293 dataset. (middle) Number of genes with perturbation-driven changes in intronic or tandem polyA site usage. (right) Number of genes with changes in tandem polyA site usage, classified by 3’UTR shortening or 3’UTR lengthening. (B-C) Read coverage plots showing differential usage of intronic sites (boxed) at the ZSCAN9 (B) and EXOSC4 locus (C). (D) Heatmap showing polyA residuals for intronic sites that are uniquely differentially utilized after perturbation of PAF complex members (PAF1/CTR9/CDC73), anti-terminators (SCAF4/SCAF8), PABPN1, and CPSF3L/RPAP2. Each heatmap cell shows the pseudobulk average of cells after grouping by sgRNA identity. (E) Heatmap showing the importance of different features in predicting intronic polyA site usage for each perturbation. Color represents the t-statistic for each covariate from a predictive linear model (STAR Methods), which was also used for hierarchical clustering. (F) Meta-gene plot showing normalized position of intronic polyA sites with significant changes for each regulator. (G) GC content for introns containing polyA sites with significant changes in usage for each regulator. (H) Width for introns containing polyA sites with significant changes in usage for each regulator. (I) Fraction of introns containing polyA sites with significant changes in usage for each regulator, grouped by intron excision speed, as classified by Mukherjee et al. See also Figure S4.
Figure 4:
Figure 4:. Modules of co-regulated polyA sites exhibit functional differences
(A) Pearson correlation matrix depicting the relationships between tandem perturbations in HEK293FT cells. Correlations are calculated using the linear model coefficients learned during differential polyadenylation analysis (STAR Methods). Genes are ordered via hierarchical clustering. (B) Same as (A), but the correlation matrix is generated from K562 polyA residuals. (C) Heatmap showing polyA-residuals for distal peak sites in Module A genes (CSTF and CPSF act in the opposite direction as CPSF6/NUDT21), and Module B genes (CSTF and CPSF act in the same direction as CPSF6/NUDT21). Top 100 polyA sites (ranked by CSTF perturbation), are shown for each module. (D) Schematic diagram of genes belonging to Module A and Module B. (E) Read coverage plots showing polyA site usage of representative genes belonging to Module A (left, CCT6A) and Module B (right, TMEM106C). (F) Density plot showing distal site usage in NT control cells for genes belonging to Module A (left) versus Module B (right). See also Figure S4.
Figure 5.
Figure 5.. A multi-task neural network predicts perturbation responses from RNA sequence.
(A) Schematic of (APARENT-Perturb), an ensemble-based neural network architecture for predicting perturbation responses. Green/blue/red output heads correspond to model predictions for the K perturbation conditions. (B) 10-fold cross-validation performance when predicting distal isoform proportions (top row) or differences in distal isoform proportion with respect to the NT condition (bottom row). (C) Sequence-specific attribution scores for two example perturbations in the KMT5A gene. Attribution scores are displayed after calculating residuals with respect to NT cells. (D) Averaged attribution scores as a function of position, for 10 perturbations. The three top MoDISco motifs are shown for each perturbation (STAR Methods). (E) Heatmap showing averaged attribution scores for each perturbation, as a function of position, for the distal-most site in each gene. (F) Mean attribution scores for CSTF1 perturbation in Module A vs Module B for both proximal sites (left) and distal sites (right). Location of the core hexamer and downstream sequence elements (DSE) are marked with solid and dashed vertical lines, respectively. Plots show the mean attribution score at single base-pair resolution (points), as well as the loess-smoothed trend (lines). (G) Epistasis analysis for dual TGTA motifs, in either GC-rich contexts (red) or AT-rich contexts (blue). The y-axis reflects the effect on predicted NUDT21 perturbation after dual insertion of both motifs, compared to the effect of inserting one motif at a time (STAR Methods). (H) Epistasis analysis of canonical hexamers and GT-rich motifs, based on the RBBP6 perturbation. See also Figure S5.
Figure 6:
Figure 6:. Validating APARENT-Perturb by performing an MPRA in multiple genetic contexts.
(A) Schematic of massively parallel reporter assay (MPRA) used to validate APARENT-Perturb. (B) Proximal polyA site usage (log-odds) for 373 WT loci, as predicted by APARENT-Perturb (x-axis), and measured by the MPRA (y-axis). Predictions are accurate in both NT (top) and CSTF3-perturbed (bottom) samples, and for both native proximal and distal sites. (C) Change in proximal polyA site usage (log-odds ratio) comparing CSTF3 and NT samples, for both sequences predicted by APARENT-Perturb to be responsive to CSTF3 perturbation (n=107, right) vs non-responsive (n=109, left) . ** indicates p-value <0.0001 wilcox test comparing log-odds ratio of responsive to neutral sequences. (D) Gene model of GYG2 3’UTR, based on inferred cleavage sites from CPA-Perturb-seq. Region highlighted in red was inserted into the MPRA construct. See also Figure S6. (E, top) Attribution scores from the APARENT-Perturb CSTF model at the distal site of the GYG2 gene (chrX:2882818), both for the wildtype (WT) sequence (top) and upon making sequence alterations. These included shuffling a predicted CSTF response element to minimize the effect of perturbation (middle), and designing synthetic mutations to maximize CSTF3 response. Colored nucleotides indicate nucleotides that were altered. (bottom) Visualization of MPRA data at the GYG2 locus. Lines represent the fraction of reads which include a polyA sequence (polyA reads), which indicates proximal cleavage, for NT and CST3F-perturbed cells (STAR Methods). (F) Effect of performing the sequence alterations described in (E) at 107 CSTF3-responsive sites, both with (left) and without (right) the canonical CSTF binding motif. The log-odds ratio of CSTF3 compared to NT (y-axis) is shown for improved and shuffled sequence alterations (x-axis) (** indicates p-value <0.0001, * indicates p=0.0001–0.05, NS = not significant). (G) Log-odds ratio comparing proximal polyA site usage in NUDT21 to NT samples for both WT sequences and after shuffling CSTF sequence elements, as in (F) (y-axis, ** indicates p-value <0.0001) (H) Fraction of reads including a polyA sequence (polyA reads), indicating proximal cleavage, for the FAM13C locus in both NUDT21-perturbed and NT cells, for both WT sequence and after shuffling CSTF sequence element. (I) Density of distal polyA site usage for WT Module A sequences (n=47, green) and after inserting CSTF-responsive sequence elements into the proximal polyA site (right, orange). (J) Gene model of the 3’UTR of PGRM1 (a Module B gene) based on inferred cleavage sites from CPA-Perturb-seq. Region highlighted in red was inserted into the MPRA construct. (K) Effect of insertion of a CSTF-responsive sequence element into the construct depicted in (J) on the fraction of polyA reads (y-axis). (L) Effect of inserting NUDT21 motif (TGTA) into neutral sequences (y-axis, log-odds ratio of polyA site usage for insertion in NT cells), when the motif was surrounded by AT-rich (left) and GC-rich (right) flanks, by distal site. (M) APARENT-Perturb ISM scores for sequences with individual or dual insertions of the TTTGTAAT motif at the PIP5K1C locus. (N) Epistasis-odds ratio (y-axis) of inserting TGTA motifs with AT-rich flanks at multiple distances (x-axis) for both NT (grey) and NUDT21 samples (green). (** indicates p-value <0.0001 for one-sided t-test of epistasis odds-ratio = 1).
Figure 7:
Figure 7:. Characterizing heterogeneity in relative polyA site usage in genome-scale Perturb-seq datasets
(A) 3’ UTR shortening preference observed after perturbing tandem regulators in the CPA-Perturb-seq dataset (x-axis), and the GWPS dataset (y-axis). (B) Same as Figure 3B, but for the GWPS dataset. (C) Correlation matrix depicting the relationship between perturbations in the GWPS dataset, as in Figure 4A. Representative genes for each of the six correlated modules are shown on the left. All genes are listed in Figure S8A. (D) Representative polyA sites (n=100) whose usage increased upon perturbation of exosome/PAXT complex members. (E) CPSF3L perturbation (from CPA-Perturb-seq) was most correlated (y-axis) with other members of the integrator complex and spliceosome factors in GWPS data. (F) Representative read coverage plot depicting shared changes in polyA site usage after perturbation of CPSF3L, other integrator complex members, and SMN complex members. (G) Enrichment of CPA-Perturb-Seq intronic signatures (y-axis) in GWPS module-responsive sites. Dot size corresponds to -log10(p-value) from hypergeometric enrichment test, and color corresponds to the average polyA residuals for each signature. (H) MaxEnt scores for splice donor sites (5’) of all intronic polyA sites quantified in our dataset versus those with significantly increased usage in CPSF3L/RPAP2 perturbation (** indicates Wilcoxon p-value <0.0001). See also Figure S7.

Update of

References

    1. Di Giammartino DC, Nishida K, and Manley JL (2011). Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853–866. - PMC - PubMed
    1. Tian B, and Manley JL (2017). Alternative polyadenylation of mRNA precursors. Nat. Rev. Mol. Cell Biol 18, 18–30. - PMC - PubMed
    1. Proudfoot NJ (2011). Ending the message: poly(A) signals then and now. Genes Dev 25, 1770–1782. - PMC - PubMed
    1. Gruber AJ, and Zavolan M (2019). Alternative cleavage and polyadenylation in health and disease. Nat. Rev. Genet 20, 599–614. - PubMed
    1. Tian B, Hu J, Zhang H, and Lutz CS (2005). A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res 33, 201–212. - PMC - PubMed