. 2024 Sep 18;25(1):305.

doi: 10.1186/s12859-024-05926-z.

Leveraging gene correlations in single cell transcriptomic data

Kai Silkwood^{1

2}, Emmanuel Dollinger^{1

2

3}, Joshua Gervin^{1

2}, Scott Atwood^{1

2}, Qing Nie^{1

2

3}, Arthur D Lander^{4

5}

Affiliations

¹ Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA.
² Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA.
³ Department of Mathematics, University of California, Irvine, Irvine, CA, USA.
⁴ Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA. adlander@uci.edu.
⁵ Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA. adlander@uci.edu.

PMID: 39294560
PMCID: PMC11411778
DOI: 10.1186/s12859-024-05926-z

Leveraging gene correlations in single cell transcriptomic data

Kai Silkwood et al. BMC Bioinformatics. 2024.

. 2024 Sep 18;25(1):305.

doi: 10.1186/s12859-024-05926-z.

Authors

Kai Silkwood^{1

2}, Emmanuel Dollinger^{1

2

3}, Joshua Gervin^{1

2}, Scott Atwood^{1

2}, Qing Nie^{1

2

3}, Arthur D Lander^{4

5}

Affiliations

¹ Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA.
² Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA.
³ Department of Mathematics, University of California, Irvine, Irvine, CA, USA.
⁴ Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA. adlander@uci.edu.
⁵ Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA. adlander@uci.edu.

PMID: 39294560
PMCID: PMC11411778
DOI: 10.1186/s12859-024-05926-z

Abstract

Background: Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data-looking for rare cell types, subtleties of cell states, and details of gene regulatory networks-there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually).

Results: We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization-a step that skews distributions, particularly for sparse data-and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene-gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships.

Conclusions: New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene-gene correlations.

Keywords: Gene co-expression network; Gene regulatory network; Gene–gene correlation; Melanoma; Single cell RNA sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests

Figures

**Fig. 1**
Relationship between Pearson correlation coefficient, vector sparsity and p value. The panels compare p values determined empirically by correlating 50,000 pairs of random, independent vectors of length 500 with p values predicted by the Fisher formula. A Data were independent random variates from Poisson distributions with means as indicated. The dashed line shows the output of the Fisher formula. B Even for vectors drawn from a distribution with mean = 1, the Fisher formula significantly mis-estimated p values smaller than 10⁻⁴. C Poor performance of the Fisher formula is worsened when data are drawn from a Poisson-log-normal distribution, rather than a Poisson distribution (in this case the underlying log-normal distribution had a coefficient of variation of 0.5). D–E Data were simulated under the scenario that gene expression is the same in every cell, but due to differences in sequencing depth, observed gene expression varies according to the depths shown in the inset to panel (D). In panel (D), true gene expression was adjusted so that observed gene expression after normalization would have a mean of 1, and the Pearson correlation coefficients (PCCs) obtained by correlating randomly chosen vectors are shown. Panel D plots empirically derived p values as a function of PCC, whereas panel E displays histograms of PCCs. Compared with data that do not require normalization, associated p values from normalized data are even more removed from the predictions of the Fisher formula

**Fig. 2**
Comparing uncorrected and modified corrected Fano factors and correlation coefficients. Random, independent, uncorrelated gene expression data was generated for 1000 genes in 999 cells, under the assumption that observations are random Poisson variates from a per-cell expression level that is itself a random variate of a log-normal distribution, scaled by a sequencing depth factor that is different for each cell (see methods). A Uncorrected ( $ϕ$ ) or modified corrected ( $ϕ^{'}$ ) Fano factors are plotted as a function of mean expression level for each gene. Uncorrected factors were calculated either without normalization, or with default normalization (scaling observations by sequencing depth factors, learned by summing the gene expression in each cell). Uncorrected Fano factors were also calculated using SCTransform [48] as an alternative to default normalization. Modified corrected Fano factors were obtained by applying BigSur to unnormalized data, using a coefficient of variation parameter of c = 0.5. B Modified corrected Fano factors ( $ϕ^{'}$ ) were calculated as in A, but using different values of c. The data suggest that an optimal choice of c can usually be found by examining a plot of $ϕ^{'}$ versus mean expression. C Empirical p values associated with uncorrected (PCC) or modified corrected (PCC′) Pearson correlation coefficients were calculated for pairwise combinations of genes in bins of different mean gene expression level (µ); examples are shown for four representative bins (both genes derived from the same bin). With increasing gene expression levels, the p value versus PCC relationship begins to approach the Fisher formula (dashed curve), but it does so much sooner for PCC′ than PCC

**Fig. 3**
Statistical significance of pairwise gene correlations in data from a clonal cell line. A, B Using scRNAseq data from a human melanoma cell line [50] (8640 cells × 14,933 genes), pairwise values of PCC were calculated from normalized data, and PCC′ from raw data. Histograms display the frequency of observed values (the logarithmic axis in B emphasizes low-frequency events). Notice in B how positive skewing, also seen in simulated data (Fig. 2), is less for PCC′ than PCC. Dashed lines in B show thresholds at which Fisher formula-derived p values would fall below 1.1 × 10.⁻⁴. C, D Scatterplots showing p values assigned by BigSur to pairs of genes within two representative sets of bins of gene expression (for all pairwise combinations see Figs. S1, S2/Additional files 3, 4). The abscissa shows PCC (panel C) and PCC′ (panel D). The ordinate gives the negative *log*₁₀ of p values determined by BigSur, i.e., larger values mean greater statistical significance. Orange and gray shading indicate gene pairs judged significant by BigSur (FDR < 0.02). Blue and orange show gene pairs that would have been judged statistically significant by applying the Fisher formula to the PCC or PCC′, using the same p value threshold as used by BigSur. The blue region contains gene pairs judged significant by the Fisher formula only, while the unshaded region shows gene pairs not significant by either method. Numbers in the lower right corner are the total numbers of possible correlations (blue), statistically significant correlations according to the Fisher formula (green), and statistically significant correlations according to BigSur (red). E, F ROC curves assessing whether the overall performance of the Fisher formula—applied either to PCC or PCC′—can be adequately improved either by using a more stringent p value cutoff (E) or limiting pairwise gene-correlations to those involving only genes with mean expression above a threshold level, µ (F)

**Fig. 4**
Clustering cells based on mitochondrial and ribosomal communities. A Correlations among the two largest gene communities detected by BigSur are shown. Vertices are genes, and edges—green and red—represent significant positive and negative correlations, respectively. Blue vertices represent members of the mitochondrially encoded gene community and brown vertices the ribosomal protein gene community (Labels have been omitted due to the large numbers of gene involved). B Using the top 50 most highly positively connected vertices in the two communities as features, cells were subjected to PCA and Leiden clustering; the three clusters recovered are displayed by UMAP. C, D After a second round of clustering of Cluster 1, the resulting four cell groups were analyzed for the distribution of expression of ribosomal protein-coding and mitochondrially-encoded genes, as a function of total UMI in each cell. The results show that the four clusters form distinct groups based on their relative abundances of ribosomal and mitochondrial genes. E Results of applying BigSur to each cluster. Note the large decrease in statistically significant correlations—especially negative correlations—in any of the clusters when compared with results obtained using all of the cells together (> 600,000 total correlations). This is consistent with heterogeneity in the original sample, causing large blocks of genes to correlate and anti-correlate

**Fig. 5**
Mitochondrial communities are a source of strong anti-correlations. Mitochondrially-encoded and ribosomal protein gene communities were identified in all clusters. Anti-correlations identified specifically between mitochondrially-encoded genes and ribosomal protein genes within the different clusters are shown in A (cluster 1.1), B (cluster 1.2), C (cluster 2) and D (cluster 3). Panel E shows additional anticorrelations in cluster 1.2 involving mitochondrially-encoded genes and glycolysis genes For ease of readability, mitochondrial genes have been highlighted in blue. Red links refer to negative correlations; green to positive

**Fig. 6**
Comparing the ability of BigSur and uncorrected PCCs to identify biologically significant correlations. Using the data from cell cluster 1.2, BigSur identified positive correlations and their p values for all genes, converting the p values to equivalent PCCs. In addition, the same starting data were also normalized and uncorrected PCCs obtained. A–D compare the correlations identified by the two methods. A Genes identified by BigSur as correlating with the Reactome Cholesterol Metabolism gene set. All significant gene–gene correlations detected by BigSur involving genes in this 27-gene set (the names of which are shown at right) and all other genes in the genome are shown graphically. Genes belonging to the set are highlighted in blue; additional known cholesterol metabolism-related genes are highlighted with rectangles. Green lines show significant positive correlations. Dashing surrounds a group of cell cycle genes that correlate with the dual-function gene *LBR*. Expression levels (mean UMI per cell after normalization) for each of the genes in the set are shown at right; genes highlighted in yellow are those that displayed any significant correlations. B Correlation communities for the same gene set, panel identified using uncorrected PCCs, visualized as in panel (A). Note that the target genes are scattered among 5 disconnected communities and are connected with genes with no obvious relationship to cholesterol metabolism. C Positive correlations involving any of the genes belonging to the Reactome Cholesterol Metabolism gene set were divided into “within-community” links and “out-of-community” links. With BigSur (filled symbols), the ratio of within-community to out-of-community links is much higher than with uncorrected PCCs (open symbols), suggesting the latter produce much less enrichment for functionally relevant connections. D Analyses similar to that in panel C were performed for eight additional MSigDB “Hallmark” gene sets. Plotted are the proportions of correlations that are within-group over a range of PCC thresholds. Numbers of genes in each panel are shown in parentheses. E Analyses similar to that in panel (D) were performed with 150-gene sets chosen at random after first removing genes with mean expression below 0.052 (so that median expression for random genes was similar to that of the functional panel genes

**Fig. 7**
Grouping into “metacells” creates false positive correlations. A–C Analysis of synthetic, uncorrelated data. A Pipelines for data processing. Each box denotes a step at which a Pearson correlation coefficient (PCC or PCC′) was calculated, with the color of the box corresponding to the colors used in the following plots. B Histograms of correlation coefficients obtained from the data sets in panel (A). Arrows show the thresholds above which observed correlations were judged statistically significant. C Numbers of correlations in panel (B) that were judged to be significant (p < 0.02). D–F Analysis of melanoma cell line data. D Pipeline for data processing. The colors of each box correspond to the colors in panels (E, F). E Number of correlations judged significant in data sets in D. Darker shading denotes the negative correlations; lighter are positive correlations. F Enrichment for known protein–protein interactions among the correlations shown in (E). The value of n in each case gives the absolute number of protein–protein interactions. Of the 12,129 interactions detected by BigSur, 11,373 were also detected using grouping of log-normalized data and 10,773 were detected using grouping of modified corrected Pearson residuals (10,324 were shared among all three)

See this image and copyright information in PMC

Update of

Leveraging gene correlations in single cell transcriptomic data.
Silkwood K, Dollinger E, Gervin J, Atwood S, Nie Q, Lander AD. Silkwood K, et al. bioRxiv [Preprint]. 2023 Nov 1:2023.03.14.532643. doi: 10.1101/2023.03.14.532643. bioRxiv. 2023. Update in: BMC Bioinformatics. 2024 Sep 18;25(1):305. doi: 10.1186/s12859-024-05926-z. PMID: 36993765 Free PMC article. Updated. Preprint.

References

1. Tritschler S, Buttner M, Fischer DS, Lange M, Bergen V, Lickert H, Theis FJ. Concepts and limitations for learning developmental trajectories from single cell genomics. Development. 2019;146:dev170506. - DOI - PubMed
1. Tam PPL, Ho JWK. Cellular diversity and lineage trajectory: insights from mouse single cell transcriptomes. Development. 2020;147:dev179788. - DOI - PubMed
1. Nguyen H, Tran D, Tran B, Pehlivan B, Nguyen T. A comprehensive survey of regulatory network inference methods using single cell RNA sequencing data. Brief Bioinform. 2021;22:bbaa190. - DOI - PMC - PubMed
1. Xie B, Jiang Q, Mora A, Li X. Automatic cell type identification methods for single-cell RNA sequencing. Comput Struct Biotechnol J. 2021;19:5874–87. - DOI - PMC - PubMed
1. Junttila S, Smolander J, Elo LL. Benchmarking methods for detecting differential states between conditions from multi-subject single-cell RNA-seq data. Brief Bioinform. 2022;23:bbac286. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Leveraging gene correlations in single cell transcriptomic data

Affiliations

Leveraging gene correlations in single cell transcriptomic data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous