Identifying genetic variants that influence the abundance of cell states in single-cell data

Laurie Rumker^{1

2

3

4

5}, Saori Sakaue^{1

2

3

5}, Yakir Reshef^{1

2

3

5}, Joyce B Kang^{1

2

3

4

5}, Seyhan Yazar^{6

7}, Jose Alquicira-Hernandez^{1

2

3

5

6}, Cristian Valencia^{1

2

3

5}, Kaitlyn A Lagattuta^{1

2

3

4

5}, Annelise Mah-Som², Aparna Nathan^{1

2

3

4

5}, Joseph E Powell^{6

7}, Po-Ru Loh^{1

2

5}, Soumya Raychaudhuri^{8

9

10

11

12}

Affiliations

¹ Center for Data Sciences, Brigham and Women's Hospital, Boston, MA, USA.
² Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
³ Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
⁴ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁵ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁶ Translational Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia.
⁷ UNSW Cellular Genomics Futures Institute, University of New South Wales, Sydney, New South Wales, Australia.
⁸ Center for Data Sciences, Brigham and Women's Hospital, Boston, MA, USA. soumya@broadinstitute.org.
⁹ Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
¹⁰ Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
¹¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
¹² Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. soumya@broadinstitute.org.

PMID: 39327486
PMCID: PMC12140807
DOI: 10.1038/s41588-024-01909-1

Identifying genetic variants that influence the abundance of cell states in single-cell data

Laurie Rumker et al. Nat Genet. 2024 Oct.

. 2024 Oct;56(10):2068-2077.

doi: 10.1038/s41588-024-01909-1. Epub 2024 Sep 26.

Authors

Affiliations

¹ Center for Data Sciences, Brigham and Women's Hospital, Boston, MA, USA.
² Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
³ Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
⁴ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁵ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁶ Translational Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia.
⁷ UNSW Cellular Genomics Futures Institute, University of New South Wales, Sydney, New South Wales, Australia.
⁸ Center for Data Sciences, Brigham and Women's Hospital, Boston, MA, USA. soumya@broadinstitute.org.
⁹ Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
¹⁰ Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
¹¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
¹² Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. soumya@broadinstitute.org.

PMID: 39327486
PMCID: PMC12140807
DOI: 10.1038/s41588-024-01909-1

Abstract

Disease risk alleles influence the composition of cells present in the body, but modeling genetic effects on the cell states revealed by single-cell profiling is difficult because variant-associated states may reflect diverse combinations of the profiled cell features that are challenging to predefine. We introduce Genotype-Neighborhood Associations (GeNA), a statistical tool to identify cell-state abundance quantitative trait loci (csaQTLs) in high-dimensional single-cell datasets. Instead of testing associations to predefined cell states, GeNA flexibly identifies the cell states whose abundance is most associated with genetic variants. In a genome-wide survey of single-cell RNA sequencing peripheral blood profiling from 969 individuals, GeNA identifies five independent loci associated with shifts in the relative abundance of immune cell states. For example, rs3003-T (P = 1.96 × 10^-11) associates with increased abundance of natural killer cells expressing tumor necrosis factor response programs. This csaQTL colocalizes with increased risk for psoriasis, an autoimmune disease that responds to anti-tumor necrosis factor treatments. Flexibly characterizing csaQTLs for granular cell states may help illuminate how genetic background alters cellular composition to confer disease risk.

PubMed Disclaimer

Figures

**Extended Data Fig. 1 ∣. Schematic representation of our approach to project a neighborhood-based phenotype into an independent dataset for testing of association replication.**
We use a published reference mapping algorithm, Symphony, to project each cell from the replication dataset (blue labels) into the embedding used for construction of the nearest neighbor graph from the discovery dataset (orange labels). For each replication dataset cell, we store its distance to the 15 nearest discovery dataset cells; these represent the seed weights of this replication dataset cell in the discovery dataset neighborhoods, of which there is one per discovery dataset cell. We use diffusion in the nearest neighbor graph to obtain from these seed weights the fractional membership of each replication dataset cell within all discovery dataset neighborhoods. For each replication dataset sample, the combination of neighborhood memberships across all cells in the sample yields the fractional abundance of that sample across discovery dataset neighborhoods. Row-wise stacking these per-sample vectors into a matrix produces an estimated Neighborhood Abundance Matrix (NAM) containing the distribution of each replication dataset sample across discovery dataset neighborhoods. We can then use the stored products of the discovery dataset NAM SVD to obtain loadings for each replication dataset sample on the discovery dataset NAM-PCs, as shown. Combining the replication dataset sample loadings on the discovery dataset NAM-PCs with the fitted coefficients that define the phenotype in the discovery dataset produces an estimated phenotype value per replication dataset sample, which can be used to test for association to the allele of interest (or case-control status), controlling for relevant covariates.

**Extended Data Fig. 2 ∣. Power to detect associations between simulated genotype values and real single-cell traits.**
For 94 real cell-state abundance traits that vary across individuals in the OneK1K dataset, we defined simulated genotype values per individual to create true associations to these traits. By including random noise in the simulated genotypes we can use these data to quantify the fraction of true associations detected by GeNA (power) across a spectrum of noise levels. At each noise level, we show the mean and standard error of statistical power across traits for a given p-value threshold. A dashed line is shown at y = 0.05.

**Extended Data Fig. 3 ∣. Illustration of 14 real cell-state abundance traits used in our non-null simulated GWAS for T cells.**
For each trait in Supplementary Fig. 9, we plot the true cell-level phenotype for which we simulated associated genotypes with varying levels of noise. Each UMAP includes one dot per T cell in the OneK1K dataset. Cells that do not affect the trait are colored grey. For example, for the “CD8 Naive Program 1” trait, we used a gene expression program that varies substantially across naive CD8 + T cells. We quantified the usage of that program across all naive CD8 + T cells and defined the trait value per individual in the dataset as the mean use of that gene expression program across all naive CD8 + T cells in that individual’s sample. Therefore, for the “CD8 Naive Program 1” trait we color all cells that are not naive CD8 + T cells grey because they do not influence the trait and we color each naive CD8 + T cell by its use of the gene expression program. Cells with greater use of the program are colored deeper red and cells with less use of the program are colored deeper blue.

**Extended Data Fig. 4 ∣. Characterization of the csaQTL at 15q25.1.**
(a) Boxplot of sample-level phenotype values for each individual, organized by genotype at the lead SNP. (N: C/C 297, C/T 194, T/T 32) (Bold line: Q2. Box: Q1-Q3. Whisker: furthest observation within 1.5xIQR of the box.) We also show the GeNA p-value. (b) UMAP of myeloid cells colored by neighborhood-level phenotype value (that is, correlation between cell abundance and dose of alternative allele per neighborhood). (c) Violin plot of neighborhood-level phenotype value distribution within CD14+ monocytes, CD16+ monocytes and dendritic cells. (d) Heatmap of expression across neighborhoods for genes with strong correlations in expression to the csaQTL neighborhood-level phenotype. Neighborhoods are arrayed along the x-axis by phenotype value. (e) UMAP of myeloid cells colored by cell type assignment to the CD16+ monocyte cluster. We also show the Pearson’s r value between neighborhood-level phenotype values and a binary encoding of CD16+ monocyte cluster membership per cell. (f) Boxplot of cluster-based CD16+ monocyte % myeloid cells trait value per OneK1K donor by genotype. (Box and whiskers defined as in subplot a). The csaQTL lead SNP explains 6.5% of variance in this phenotype. (g) Locus zoom plot with one marker per tested SNP, genomic position along the x-axis, and GeNA p-value on the y-axis. Each SNP marker is colored by LD value relative to the lead SNP. The lead SNP is labeled with a green diamond. The *BCL2A1* eQTL lead SNP and primary sclerosing cholangitis risk lead SNP are labeled with purple triangles. (h) Diagram of genotypes for the csaQTL lead SNP and colocalizing associations to molecular, tissue and organism-level traits at this locus.

**Extended Data Fig. 5 ∣. Comparison of effects captured by NAM-PCs to published csaQTL associations, using approximation of flow cytometry phenotypes in scRNA-seq data.**
Z-scores for published SNP associations to specific cell-state abundance phenotypes quantified using flow cytometry by Orrù *et al*. are shown on the x-axis. For each SNP-trait pair, a corresponding Z-score is shown on the y-axis reflecting an association test in the OneK1K dataset between genotype and the best approximation that can be captured by NAM-PCs of a cluster-based estimate of that phenotype in the OneK1K dataset (Supplementary Methods).

**Extended Data Fig. 6 ∣. GeNA’s statistical power increases linearly with the number of samples included in the single-cell dataset.**
We downsampled the OneK1K cohort individuals at random to 80%, 60%, 40% or 20% of the total donor count and repeated our power analysis simulation for each downsampled dataset. Here we plot statistical power by dataset size for simulated genotypes that explain 6% or 12% of variance in the associated cell-state abundance shift trait. At each cohort size, we show the mean and standard error of GeNA’s statistical power for simulated SNPs that explain 6% or 12% of phenotypic variance.

**Fig. 1 ∣. Method schematic.**
a, Consider a variant associated with disease risk. Intermediate traits that may also associate with the variant and even mediate the genetic risk include well-studied molecular traits (for example, transcript or protein abundance) as well as changes in the abundance of cells with varying character and function, illustrated here by variation in shape and color. Previous genetic studies of cell-state abundance traits have quantified target cell states (for example, triangle cell type) using flow cytometry. High-dimensional profiling may reveal genetically associated variation in the abundance of cell states researchers may not anticipate or cannot flow-sort (for example, green character within the circle cell type). Detection of such associations requires granular information about cell variation and a flexible method for detecting variant-associated cell states. b, For a given single-cell dataset, we use the landscape of cells observed for each sample to compute a granular distribution of fractional cellular abundance across the total cell-state space, conceptually illustrated here in two dimensions. We illustrate a single axis of compositional variation across the four samples shown. c, The NAM stores the fractional abundance of cells from each sample in each neighborhood. Sample H is highlighted as an example. PCA of this object yields sample and neighborhood loading information on NAM-PCs. We illustrate with NAM-PC1 how these loadings would reflect the compositional axis illustrated in b. We show another possible component (NAM-PC $k$ ) to illustrate that NAM-PCs can capture covariation across transcriptionally distant regions. d, GeNA uses a test statistic $Y$ , which follows a chi-squared distribution with $k$ degrees of freedom, to detect an association between allele dose for a given SNP and any systematic change in tissue cellular composition captured by NAM-PC1- $k$ (Methods). e, We illustrate how the cell-state abundance shift shown in a might be revealed using GeNA.

**Fig. 2 ∣. csaQTLs detected in the OneK1K dataset.**
a, Superimposed Manhattan plots for csaQTL GWASs among NK cell states and among myeloid cell states. A genome-wide significance threshold of $P < 5 \times 10^{- 8}$ is indicated by a dashed line. SNPs with $P$ values less than this threshold are colored according to their source GWAS: NK cells (green) or myeloid (orange). Another dashed line indicates a $P < 1 \times 10^{- 6}$ threshold for suggestive associations. b–f, Cell abundance correlation per neighborhood to dose of alternative allele is shown in uniform manifold approximation and projections (UMAPs) for each of the five genome-wide significant loci: NK csaQTL 2q13 (b), NK csaQTL 11q24.3 (c), NK csaQTL 12p13.2 (d), myeloid csaQTL 15q25.1 (e) and NK csaQTL 19p13.11 (f). g,h, Cell-type cluster labels for NK (g) and myeloid (h) cells are shown for reference. ASDC, AXL⁺ dendritic cell; DC1, conventional dendritic cell 1; cDC2, conventional dendritic cell 2; ILC, innate lymphoid cell; neg, negative; pDC, plasmacytoid dendritic cell; pos, positive.

**Fig. 3 ∣. Characterization of the csaQTL at 12p13.2.**
a,b, GeNA output. a, Boxplot of sample-level phenotype values for each individual organized by genotype at the lead SNP. (N: C/C 12, C/T 268, T/T 655; bold line: quartile 2; box: quartile 1 to quartile 3; whisker: furthest observation within 1.5× interquartile range (IQR) of the box.) We also show the GeNA $P$ value. b, UMAP of NK cells colored by neighborhood-level phenotype value (that is, correlation between cell abundance and dose of alternative allele per neighborhood). c, Heatmap of expression across neighborhoods for genes with strong correlations in expression to csaQTL neighborhood-level phenotype. Neighborhoods are arrayed along the $x$ axis by phenotype value. The phenotype-correlated genes include general markers of NK activation (*CD69*, NFKBIA) as well as TNF (*DUSP2*, *ZFP36*, *JUNB*, *IER2*) and IFNγ (*CD74*, *XCL1*) response genes. d,e, GSEA identified significant activation of TNF and IFNγ response pathways in association with the csaQTL phenotype. We show UMAPs of NK cells colored by summed expression of TNF response genes (d) and IFNγ response genes (e). We report the Pearson’s $r$ across neighborhoods between phenotype values and summed expression within the gene set. We also show the false discovery rate (FDR)-adjusted gene set enrichment $P$ value (Supplementary Methods). f, Locus zoom plot with one marker per tested SNP, genomic position along the $x$ axis and GeNA $P$ value on the $y$ axis. Each SNP marker is colored by LD value relative to the lead SNP. The csaQTL lead SNP is labeled with a green diamond. The psoriasis risk and *KLRC1* eQTL lead SNPs are labeled with purple triangles. g, Diagram of genotypes for the csaQTL lead SNP and colocalizing associations to molecular, tissue and organism-level traits at this locus.

**Fig. 4 ∣. PRSs aggregate the effects of individual loci to highlight disease-relevant cell states, a valuable point of comparison with single-cell case-control analyses.**
a, Histogram of SNPs in the SLE PRS. For each SNP, we plot the Pearson’s $r$ correlation across OneK1K myeloid neighborhoods between the SLE risk-allele effect and an IFNα response gene signature (shown in c). The marker for each SNP is colored according to its effect weight in the PRS. Six SNPs plotted in b are highlighted in orange. b, Six SNPs in the SLE PRS for which the myeloid cell-state abundance correlations to the SLE risk allele correspond closely to the IFNα response signature (shown in c). For each selected SNP, we plot a UMAP of OneK1K myeloid cells colored by the abundance correlation per neighborhood to dose of the risk allele. We also report the gene(s) to which the SNP has been mapped, Pearson’s $r$ correlation between the neighborhood-level phenotype and IFNα response signature, and the FDR-adjusted GeNA $P$ value (Supplementary Methods). c, IFNα response gene expression per neighborhood among myeloid cells from the OneK1K cohort. d, Myeloid cell-state abundance shift associated with increasing SLE PRS value in the OneK1K cohort. CNA global $P$ values are shown with ( $P_{IFN}$ ) and without ( $P$ ) controlling for mean IFNα response gene expression per individual (Supplementary Methods). Pearson’s $r$ between IFNα response per neighborhood (c) and the PRS phenotype is also shown ( $r_{IFN}$ ). e, IFN response gene expression per neighborhood among myeloid cells in the Perez et al. European ancestry cohort (Supplementary Methods). f, Myeloid cell-state abundance shift associated with SLE disease status in the Perez et al. European ancestry cohort. CNA global $P$ values are shown with ( $P_{IFN}$ ) and without ( $P$ ) controlling for mean IFN response gene expression per individual. Pearson’s $r$ between IFN response per neighborhood and SLE phenotype is also shown ( $r_{IFN}$ ).

See this image and copyright information in PMC

Update of

Identifying genetic variants that influence the abundance of cell states in single-cell data.
Rumker L, Sakaue S, Reshef Y, Kang JB, Yazar S, Alquicira-Hernandez J, Valencia C, Lagattuta KA, Mah-Som A, Nathan A, Powell JE, Loh PR, Raychaudhuri S. Rumker L, et al. bioRxiv [Preprint]. 2023 Nov 15:2023.11.13.566919. doi: 10.1101/2023.11.13.566919. bioRxiv. 2023. Update in: Nat Genet. 2024 Oct;56(10):2068-2077. doi: 10.1038/s41588-024-01909-1. PMID: 38014313 Free PMC article. Updated. Preprint. No abstract available.

References

1. Visscher PM et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet 101, 5–22 (2017). - PMC - PubMed
1. Welter D. et al. The NHGRI GWAS Catalog, a curated resource of SNP–trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014). - PMC - PubMed
1. Shendure J, Findlay GM & Snyder MW Genomic medicine–progress, pitfalls, and promise. Cell 177, 45–57 (2019). - PMC - PubMed
1. Yazar S. et al. Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease. Science 376, eabf3041 (2022). - PubMed
1. Wang QS et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun 12, 3394 (2021). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying genetic variants that influence the abundance of cell states in single-cell data

Affiliations

Identifying genetic variants that influence the abundance of cell states in single-cell data

Authors

Affiliations

Abstract

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources