Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May;56(5):925-937.
doi: 10.1038/s41588-024-01726-6. Epub 2024 Apr 24.

Joint genotypic and phenotypic outcome modeling improves base editing variant effect quantification

Affiliations

Joint genotypic and phenotypic outcome modeling improves base editing variant effect quantification

Jayoung Ryu et al. Nat Genet. 2024 May.

Abstract

CRISPR base editing screens enable analysis of disease-associated variants at scale; however, variable efficiency and precision confounds the assessment of variant-induced phenotypes. Here, we provide an integrated experimental and computational pipeline that improves estimation of variant effects in base editing screens. We use a reporter construct to measure guide RNA (gRNA) editing outcomes alongside their phenotypic consequences and introduce base editor screen analysis with activity normalization (BEAN), a Bayesian network that uses per-guide editing outcomes provided by the reporter and target site chromatin accessibility to estimate variant impacts. BEAN outperforms existing tools in variant effect quantification. We use BEAN to pinpoint common regulatory variants that alter low-density lipoprotein (LDL) uptake, implicating previously unreported genes. Additionally, through saturation base editing of LDLR, we accurately quantify missense variant pathogenicity that is consistent with measurements in UK Biobank patients and identify underlying structural mechanisms. This work provides a widely applicable approach to improve the power of base editing screens for disease-associated variant characterization.

PubMed Disclaimer

Conflict of interest statement

Competing interests

L.P. has financial interests in Edilytics and SeQure Dx. L.P.’s interests were reviewed and are managed by Massachusetts General Hospital and Partners HealthCare in accordance with their conflict-of-interest policies. The remaining authors declare no competing interests.

Figures

Extended Data Fig. 1 |
Extended Data Fig. 1 |. Base editor editing preference profile and context specificity.
Deamination motif and PAM-dependent editing preference of AID-BE5-SpRY from 7294 gRNAs and AID-BE5-Cas9NG from 7299 gRNAs with more than 9 read counts across any replicates of bulk samples. a) Context specificity of AID-BE5-SpRY are represented as sequence logos. The height of each base represents the relative editing efficiency with each base. b) Mean editing efficiency of AID-BE5-SpRY by protospacer position and PAM sequence. c) Context specificity of AID-BE5-CasNG is represented as sequence logos. The height of each base represents the relative editing efficiency with each base. d) Mean editing efficiency of AID-BE5-CasNG by protospacer position and PAM sequence.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Nucleotide-level editing comparison of reporter and endogenous locus.
a) Scatterplots comparing of per-nucleotide-level editing efficiencies between the reporter and endogenous target sites. All edits introduced by each of 49 gRNAs across four loci across 3 experimental replicates were plotted. Points are colored by the identity of nucleotide edit and gRNA. b) The same plot colored by gRNA strand. R; Pearson correlation coefficient. n; number of plotted editing rates.
Extended Data Fig. 3 |
Extended Data Fig. 3 |. BEAN plate diagrams.
Plate diagrams of a) BEAN b) BEAN-Reporter, c) BEAN-Uniform. Xb and all parameters with superscript b is not used for benchmark analyses.
Extended Data Fig. 4 |
Extended Data Fig. 4 |. LDL-C GWAS library classification task benchmark.
a) AUPRC plot for classifying positive splicing control variants against negative control variants. Metrics for all 5 replicates are shown as markers and metrics of 15 two-replicate subsamples among the 5 replicates are shown as box plots. Boxplot was plotted as described in the statistical note of the Methods section. b) Precision-Recall curve for classifying all positive control splice sites of against negative controls for all replicates with no failing samples. c) Precision-Recall curve for classifying splice sites of LDLR/MYLIP against negative controls for all replicates with no failing samples. d) Precision-Recall curve for classifying all positive control splice sites of against negative controls for 2-replicate subsample of the data. Mean Precision value for a recall across 15 subsample runs are plotted as solid line. e) Precision-Recall curve for classifying splice sites of LDLR/MYLIP against negative controls for 2-replicate subsample of the data. Mean Precision value for a recall across 15 subsample runs are plotted as solid line.
Extended Data Fig. 5 |
Extended Data Fig. 5 |. Comparison of inferred effect sizes of individually transfected LDL-C GWAS library gRNAs.
Scatterplot and Pearson correlation coefficients (R) of effect size estimates and the log fold change (LFC) of fluorescence signal following individual transfection of 22 gRNAs. R; Spearman correlation coefficient.
Extended Data Fig. 6 |
Extended Data Fig. 6 |. BEAN accurately estimates variant effect confidence from per-variant evidence in input data.
a-c) Scatterplot of 2,182 LDLR tiling library variants comparing a) nnorm and effective edit rates, b) effective edit rates and BEAN σμ, and c) nnorm and BEAN σμ. d) Histogram of effective edit rates of 76 LDLR tiling library variants with UKB LDL-C levels. Quartile bin cutoffs used to categorize variants are shown as dotted lines. e) Scatterplots of BEAN z-scores and statin-adjusted UKB LDL-C measurements for variants in each effective edit rate quartile bin. r and rho shows the Pearson and Sparman correlation coefficients, respectively.
Extended Data Fig. 7 |
Extended Data Fig. 7 |. LDLR tiling library classification task benchmark.
a) AUPRC of classifying Pathogenic/Likely Pathogenic from Benign/Likely Benign variants when using 4 replicates without failing samples and 6 2-replicates combinations among the replicates. Bounds and the center of the boxes are the interquartile ranges. Boxplot was plotted as described in the statistical note of the Methods section. b-e) Precision-recall curve of classifying b, d) Pathogenic/Likely Pathogenic c, e) Pathogenic from Benign/Likely Benign variants. Top panels (b, c) show classification when used 4 replicates without failing samples. Bottom panels (d, e) show when used 6 2-replicates combinations among 4 replicates without failing samples.
Extended Data Fig. 8 |
Extended Data Fig. 8 |. Comparison of functional impact and conservation within conserved LDLR domains.
Repeat domain alignments shown with BEAN z-score for a) LDLR class A repeat domain, b) LDLR class B repeat domain, c) EGF-like domains aligned with the Pfam profile HMM logo by Skylign, where the height of each position show its information content and letter heights show the total height scaled by relative frequencies of the letters in the position. For a), conserved cysteine residue position is highlighted and for b-c), consensus positions from Clustal Omega alignment output are highlighted in grey.
Extended Data Fig. 9 |
Extended Data Fig. 9 |. Expanded LDLR missense variant pathogenicity estimates with FUSE.
a) Scatterplot of all considered UKB variant mean statin-adjusted LDL-C level against imputed BEAN-FUSE score. b) Prediction outcome of unobserved variants with XGBoost model trained on observed UKB variants and mean statin-adjusted LDL levels. c-d) Correlation coefficients and root mean squared error (RMSE) for predicted and true UKB mean statin-adjusted LDL-C level for XGBoost model with FUSE score, PhastCons PhyloP conservation score, and both as the input in predicting LDL-C levels. c) Boxplot of metrics for prediction of observed variants with 10-fold cross validation (n = 10) d) Barplot of metrics for prediction of unobserved variants with model trained on observed variants (n = 1). r, ρ; Pearson, Spearman correlation coefficient, RMSE; Root mean squared error.
Extended Data Fig. 10 |
Extended Data Fig. 10 |. Local atomic interaction in wild type and mutated structure for selected variants in LDLR class B repeat domain.
ak, Residues with interaction with the variant position are shown. Variant positions and interacting residues are colored by the reference amino acid and atomic elements (O: red, N: blue, S: yellow). Ref AA; reference amino acid.
Fig. 1 |
Fig. 1 |. Activity-normalized base editing screening pipeline.
a, Schematic of the activity-normalized base editing screening process and analysis by BEAN. A library of gRNA species, each paired with a reporter sequence encompassing its genomic target sequence, is cloned into a lentiviral base editor expression vector. Lentiviral transduction is performed in HepG2 cells, followed by sorting four populations by flow cytometry based on fluorescent LDL-C (BODIPY-LDL) uptake. The gRNA and reporter sequences are read out by paired-end NGS to obtain gRNA counts and reporter editing outcomes in each flow cytometric bin. BEAN models the reporter editing frequency and allelic outcomes and gRNA enrichments among flow cytometric bins using BEAN to estimate variant phenotypic effect sizes. b, Schematic of the LDL-C variant library gRNA design for selected GWAS candidate variants with a Manhattan plot showing variant P values from a recent GWAS study (P values are for the logistic regression coefficient being nonzero; see ref. for details). gRNA species tile the variant at five positions with maximal editing efficiency (protospacer positions 4–8). M, million; chr, chromosome. c, gRNA coverage of the LDLR tiling library across the LDLR coding sequence along with 5′ and 3′ untranslated regions and several regulatory regions. Coordinates shown are of hg19. The left diagram was created with https://www.biorender.com. d, Average editing efficiency of ABE8e–SpRY by protospacer position and PAM sequence. e, Adjacent nucleotide specificity of ABE8e–SpRY editing represented as a sequence logo from 7,320 gRNA species; the height of each base represents the relative frequency of observing each base given an edit at position 0. f, Scatterplots of per-nucleotide editing efficiencies in the reporter and endogenous target sites. All edits introduced by each of 49 gRNA species across four loci and three experimental replicates are plotted separately. The number of nucleotide edits across the three replicates is reported in each panel as n. The accessibility of the four loci as measured by ATAC-seq (assay for transposase-accessible chromatin with sequencing) signal in HepG2 cells is shown at the bottom, and scatterplot markers are colored by the accessibility of each nucleotide. Pearson correlation coefficients are shown as r.
Fig. 2 |
Fig. 2 |. BEAN models variant effects from activity-normalized base editing screens.
Simplified schematic of the BEAN Bayesian network that models input reporter editing outcomes and gRNA counts. The Bayesian network model recapitulates the data generation process starting from a variant-level phenotype Yv and models per-gRNA phenotypes (Yg) as a Gaussian mixture distribution of edited and unedited (wild-type) allele phenotypes. The weights of the mixture components are modeled to generate reporter editing outcomes. gRNA abundance in each sorting bin is then calculated by discretizing the gRNA phenotype based on the experimental design into phenotypic quantiles and is modeled to generate the observed gRNA counts using an overdispersed multivariate count distribution (Methods). BEAN outputs the parameters of the posterior distribution of the mean phenotypic shift as a Gaussian distribution with mean μ^μ (effect size), along with negative control adjusted z scores and CIs, where D is the input data. WT, wild type.
Fig. 3 |
Fig. 3 |. BEAN improves variant impact estimation from the LDL-C GWAS library screen.
a, Ridge plot of BEAN z-score distributions of positive controls, negative controls and test variants. b, AUPRC for classifying LDLR and MYLIP splicing variants versus negative controls. Metrics for all five replicates are shown as crosses, and metrics of 15 two-replicate subsamples among the five replicates are shown as box plots. Metrics are shown for BEAN, MAGeCK-RRA, MAGeCK-MLE, CRISPRBetaBinomial (CB2), efficiency-corrected log fold change (EC-LFC) and log fold change (LFC). c, Scatterplot of variant effect size and significance estimated by BEAN. Labels show rsIDs of selected variants and dbSNP gene annotations and a manual annotation for the APOB enhancer in parentheses. d, Spearman correlation coefficient between BEAN effect size and log fold change of LDL-C uptake following individual testing of 26 gRNA species. Metrics for all five replicates are shown as crosses, and metrics of 15 two-replicate subsamples among the five replicates are shown as box plots. e, Scatterplot of BEAN effect size and log fold change of LDL-C uptake following individual testing of 26 gRNA species. The Spearman correlation coefficient is denoted as ρ. For statistics in box plots, please see Statistical note for plots.
Fig. 4 |
Fig. 4 |. Functional characterization of LDL-C GWAS variants.
a, Schematic of potential variant mechanisms and figure panels showing data from each mechanistic experiment. b, Schematic of allelic representation analysis to identify variants impacting accessibility. Differential representation of alleles in gDNA and ATAC-seq reflects differential accessibility induced by the base edit or heterozygous reference alleles. RT–qPCR, quantitative reverse transcription PCR. c, Change in ATAC-seq enrichment from the pooled ATAC-seq experiment. Enrichment values are plotted as bars, and 95% confidence intervals are shown as error bars. ‘Edited variant’ denotes enrichment by the base edit, and ‘allele’ denotes enrichment by either of the heterozygous alleles, when available, represented as the enrichment of major (Maj) to minor (Min) alleles. Variants for which the base edit is conducted from the minor allele are shown as red in the color bar. FWER with Benjamini–Hochberg multiple correction is displayed as text for each enrichment value where FWER < 0.1. Three experimental replicates across two conditions were used to obtain the enrichment (Methods). d, Genomic tracks for three selected variants. Multiple transcript variants of RGS19 and OPRL1 are shown in the middle. DNaseIHS, DNase 1 hypersensitivity; GTEx, Genotype–Tissue Expression. e, Fraction of ZNF329 minor allele haplotype reads in gDNA and complementary DNA from untreated HepG2 cells and HepG2 cells with rs35081008Min>Maj base editing measured using Sanger sequencing. Rep, replicate. f, Change in gene expression following base editing of three selected variants from minor to major alleles. P values of one-sample Student’s t-test of log fold change versus a mean of 0 that are smaller than 0.05 are shown above each bar. Error bars show the 95% confidence interval of bootstrapping. n = 7 experimental replicates for RGS19; n = 8 for the rest. g, Change in cellular LDL-C uptake following CRISPRa or CRISPRi of proximal genes for three selected variants. P values of one-sample Student’s t-test of log fold change versus a mean of 0 that are smaller than 0.05 are shown above each bar. Error bars show the 95% confidence interval of bootstrapping. n = 12 experimental replicates for CRISPRa, n = 6 for CRISPRi. h, Summary schematic of characterization results. For statistics in box plots, please see Statistical note for plots.
Fig. 5 |
Fig. 5 |. Dissection of LDLR variant effects through BEAN modeling of a saturation tiled base editing screen.
a, BEAN model for coding sequence tiling screens. Coding region editing efficiencies are aggregated into amino acid-level mutations. Phenotypes of gRNA species with multi-allelic outcomes are modeled as the Gaussian mixture of allelic phenotypes (Methods). b, Scatterplot of estimated variant effect sizes and z scores, colored by ClinVar annotation. Point size corresponds to effective editing rate. Splice site variants are outlined in red. Selected deleterious variants without ClinVar P/LP annotations are labeled; labels are colored by protein domain. c, Ridge plot of BEAN z-score distributions of positive splicing control, synonymous and missense variants. d, Ridge plot of BEAN z-score distributions of ClinVar variants by clinical significance annotations. e, AUPRC of classifying ClinVar P/LP versus B/LB variants. Crosses show the metrics using all four replicates with no failing samples. Box plot shows the metrics of six two-replicate combinations of the replicates. f, LDLR protein domain structure adopted from Oommen et al.. EGF, epidermal growth factor. g, BEAN z scores for variants in the seven LDLR class A repeat domains aligned with the Pfam profile HMM logo by Skylign. Highly conserved cysteines are highlighted in gray. Ref AA, reference amino acid. h, Scatterplot of LDLR class A repeat missense variant BEAN z scores and ΔPfam profile HMM scores. Higher ΔPfam scores correspond to substitution from highly conserved to rarely observed amino acids (r = 0.569, ρ = 0.431). Alt, alternate; ref, reference. i, Comparison of mean statin-adjusted LDL-C levels and BEAN z scores for variants observed in the UKB and base editing (r = −0.454, ρ = −0.400). j, Comparison of mean statin-adjusted LDL-C levels and BEAN-FUSE scores for variants observed in the UKB and base editing (r = 0.512, ρ = 0.497). k, Predicted LDL-C levels of observed missense variants using BEAN-FUSE and PhyloP scores with tenfold cross-validation, compared with mean statin-adjusted LDL-C levels in the UKB (r = 0.608, ρ = 0.475). l, Box plots of BEAN-FUSE functional scores for UKB individuals with variants observed in our base editing screen with or without CAD. m, Box plots of statin-adjusted LDL-C levels of UKB individuals with variants observed in our base editing screen with or without CAD. The P value shown was determined by two-sided Wilcoxon rank-sum test. r, ρ, Pearson, Spearman correlation coefficient; RMSE, root-mean-squared error. For statistics in box plots, please see Statistical note for plots.
Fig. 6 |
Fig. 6 |. Deleterious variants in LDLR class B repeats weaken hydrophobic interactions.
ad, Box plots of predicted ΔΔG (a), change in the number of hydrophobic interactions in the mutated structure (b), RSA (c) and residue depth (d) of 26 significant variants (z < −1.96) and the rest of the 259 variants (z > −1.96) observed in LDLR class B repeats. P values of two-sided Wilcoxon rank-sum test are denoted. Mut, mutated. e, Predicted protein structure of the LDLR class B repeat domain. Conserved interactions involving tyrosine of which mutation showed significant BEAN scores. Simplified interaction types and distance flags as annotated by Arpeggio are shown in the legend. f, BEAN z scores of positions with conserved hydrophobic residues are shown along with the LDLR class B repeat Pfam HMM logo. gi, Local atomic interaction in wild-type and mutated structures for ClinVar conflicting variants or VUS L426P, M652T and Y532C. Residues in the variant positions are colored by the reference amino acids. Residues that interact with the variant position are shown. Variant positions and interacting residues are colored by elements (O, red; N, blue; S, yellow). j, Contour plot of BEAN z scores against predicted ΔΔG values for 872 missense variants. Positions with distinct observed missense variants that disrupt and conserve hydrophobic side chains are connected by dashed lines. For statistics in box plots, please see Statistical note for plots.

Update of

References

    1. Tam V et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet 20, 467–484 (2019). - PubMed
    1. Bycroft C et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). - PMC - PubMed
    1. Gasperini M, Starita L & Shendure J The power of multiplexed functional analysis of genetic variants. Nat. Protoc 11, 1782–1787 (2016). - PMC - PubMed
    1. Araya CL & Fowler DM Deep mutational scanning: assessing protein function on a massive scale. Trends Biotechnol 29, 435–442 (2011). - PMC - PubMed
    1. Myers RM, Tilly K & Maniatis T Fine structure genetic analysis of a β-globin promoter. Science 232, 613–618 (1986). - PubMed