Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Nov 11:2023.10.10.561733.
doi: 10.1101/2023.10.10.561733.

DeMixSC: a deconvolution framework that uses single-cell sequencing plus a small benchmark dataset for improved analysis of cell-type ratios in complex tissue samples

Affiliations

DeMixSC: a deconvolution framework that uses single-cell sequencing plus a small benchmark dataset for improved analysis of cell-type ratios in complex tissue samples

Shuai Guo et al. bioRxiv. .

Update in

Abstract

Bulk deconvolution with single-cell/nucleus RNA-seq data is critical for understanding heterogeneity in complex biological samples, yet the technological discrepancy across sequencing platforms limits deconvolution accuracy. To address this, we introduce an experimental design to match inter-platform biological signals, hence revealing the technological discrepancy, and then develop a deconvolution framework called DeMixSC using the better-matched, i.e., benchmark, data. Built upon a novel weighted nonnegative least-squares framework, DeMixSC identifies and adjusts genes with high technological discrepancy and aligns the benchmark data with large patient cohorts of matched-tissue-type for large-scale deconvolution. Our results using a benchmark dataset of healthy retinas suggest much-improved deconvolution accuracy. Further analysis of a cohort of 453 patients with age-related macular degeneration supports the broad applicability of DeMixSC. Our findings reveal the impact of technological discrepancy on deconvolution performance and underscore the importance of a well-matched dataset to resolve this challenge. The developed DeMixSC framework is generally applicable for deconvolving large cohorts of disease tissues, and potentially cancer.

Keywords: Transcriptomic deconvolution; age-related macular degeneration; bulk RNA sequencing; retina; single-cell RNA sequencing; single-nucleus RNA sequencing; technological discrepancy.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare that they have no competing interests.

Figures

Extended Data Figure 1 |
Extended Data Figure 1 |. Overview of the matched bulk and snRNA-seq data.
A and B, UMAP projection of snRNA-seq data from 4 healthy retinal samples in batch-1 and 20 healthy retinal samples in batch-2, annotated by cell types. C and D, UMAP projection of snRNA-seq data from 4 healthy retinal samples in batch-1 and 20 healthy retinal samples in batch-2, annotated by sample IDs. Cells were clustered by their biological annotations instead of sample origins, suggesting negligible batch effects. E, Distribution of the first two principal components for the matched real-bulk and pseudo-bulk RNA-seq data in the benchmark dataset. F, Boxplot showing the raw read depth between bulk and pseudo-bulk RNA-seq data from batch-1 and batch-2. The P-values for Wilcoxon rank-sum tests comparing sequencing read depth between bulk and pseudo-bulk data are denoted as follows: *P-value ≤0.05; **P-value ≤0.01; and ***P-value ≤0.001.
Extended Data Figure 2 |
Extended Data Figure 2 |. Disc maintains low collinearity among top-weighted genes in the benchmark dataset.
A and B, Boxplots showing the numbers of low and high discrepancy genes within the top 1,000 weighted genes across samples: DeMixSC (A) and MuSiC (B). C, Boxplot showing the between-cell-type collinearity among 21 cell-type pairs in the benchmark dataset across 24 retinal samples, as measured by Pearson correlation coefficient. The cell-type pairs are ordered according to their cell-type proportions within the tissue sample. MuSiC (red boxes) exhibits overall higher collinearity, especially for (Rods, Cones), (ACs, BCs), and (ACs, RGCs). In contrast, DeMixSC (blue boxes) presents only a slightly elevated correlation for (ACs, RGCs). The degree of collinearity is categorized as high (≥0.7), medium (0.4–0.7), and low (<0.4). D, Heatmaps showing cell-type-specific expression patterns of the top 1,000 weighted genes, using Sample 8 as an example for a mid-level correlation.
Extended Data Figure 3 |
Extended Data Figure 3 |. Deconvolution performance of the tree-guided MuSiC and Ensemble-SCDC.
A, Hierarchical clustering of the cell-type-specific reference matrix. B, Boxplot showing the distributions of estimated cell-type proportions from the benchmark data using the tree-guided MuSiC. C, Boxplot showing the distributions of estimated cell-type proportions from the benchmark data using the SCDC ensemble mode. Black denotes the ground truth estimated using the snRNA-seq data. Gray denotes estimates from the pseudo-bulk RNA-seq data, and red denotes estimates from the matched bulk RNA-seq data.
Extended Data Figure 4 |
Extended Data Figure 4 |. Impact of data normalization on the deconvolution performance.
A, C, and E, Boxplots showing the deconvolution performance across DeMixSC and seven current single-cell-based deconvolution methods for bulk and pseudo-bulk mixtures. RMSE values are calculated across seven major cell types for each sample, with gray denoting pseudo-bulk and red denoting real-bulk. Smaller values indicate higher accuracy in proportion estimation. B, D, and F, Heatmaps showing the deconvolution performance at the cell-type level across the eight methods using RMSE. Lighter colors correspond to lower RMSE values. Each panel corresponds to a normalization strategy: RPM (B), RPKM (D), and TPM (F).
Extended Data Figure 5 |
Extended Data Figure 5 |. DeMixSC maintains low collinearity among top-weighted genes in the AMD cohort.
A, Boxplot showing numbers of low and high discrepancy genes within the top 1,000 weighted genes for each sample. B, Boxplot showing the between-cell-type collinearity among 45 cell-type pairs (for 10 cell types) in the AMD cohort across 453 samples, as measured by Pearson correlation coefficient. The cell-type pairs are ordered according to their cell-type proportion within the tissue sample. Most of the cell-type pairs exhibit low collinearity, especially for pairs of major cell types on the left of the plot. Only the correlations between ACs and RGCs are slightly higher. A similar observation is made when deconvolving the benchmark data. The degree of collinearity is categorized as high (≥0.7), medium (0.4–0.7), and low (<0.4).
Extended Data Figure 6 |
Extended Data Figure 6 |. Cell-type proportion estimates for the AMD cohort with existing methods.
A, B, and C, Boxplots showing the distributions of cell-type proportion estimates for non-AMD retina vs. AMD retina from MuSiC2 (A), CIBERSORTx (B), SQUID (C). The P-values for Student’s t-tests comparing the estimated cell-type proportions between non-AMD (healthy) and AMD groups are denoted as follows: not significant (ns), P-value >0.05; *P-value ≤0.05; **P-value ≤0.01; and ***P-value ≤0.001.
Extended Data Figure 7 |
Extended Data Figure 7 |. DeMixSC recovers a dynamic shift in cell-type proportions during the AMD progression.
A, B, and C, Boxplots showing the distributions of cell-type proportion estimates across different MGS stages from MGS1 to MGS4. Each panel corresponds to a given cell type: Rods (A), BCs (B), and MGs (C).
Extended Data Figure 8 |
Extended Data Figure 8 |. Convergence of DeMixSC with different starting values.
A, A list of different starting values across ten cell types. B, Trace plots of estimated proportions over iterations.
Figure 1 |
Figure 1 |. Assessing technological discrepancy between bulk and single-cell sequencing platforms using matched single-nuclei aliquots.
A, Workflow for generating a benchmark dataset. We collect 24 healthy human retinal samples within six hours of postmortem. An illustration shows the layer and cell compositions of the human retina. Seven major cell types include photoreceptors (Rod and Cone cells), bipolar cells (BCs), retinal ganglion cells (RGCs), horizontal cells (HCs), amacrine cells (ACs), and Müller glia cells (MGs). Three minor cell types are not depicted in the illustration: astrocytes, microglia cells, and retinal pigment epithelial cells (RPEs). Samples are isolated into single-nucleus suspensions. The same aliquot of single-nuclei is used for both bulk and snRNA-seq profiling. The matched pseudo-bulk mixtures are generated as conventionally done by summing UMI counts across cells from all cell types in each sample. This data generation pipeline guarantees the matched bulk and snRNA-seq data share the same cell-type proportions, which enables us to evaluate the impact of technological discrepancy (i.e., the shot-gun sequencing procedure) on the bulk and snRNA-seq expression profiles. B and C show the influence of technological discrepancy at the sample- and gene-level, respectively. B, Pearson correlation coefficient across genes between the matched real-bulk and pseudo-bulk RNA-seq data for one sample at a time for both batches. C, MA-plots displaying the mean expression levels of all genes between matched real-bulk and pseudo-bulk data. Differentially expressed (DE) genes are identified using the paired t-test with Benjamini-Hochberg (BH) adjustment. Red represents genes expressed higher in the real-bulk, and blue represents genes expressed higher in the pseudo-bulk. The horizontal dotted lines denote a 2-fold change between matched real-bulk and pseudo-bulk data. adj.p: adjusted P-values. D, Venn diagrams showing genes consistently expressed higher in the bulk (upper) or the pseudo-bulk (bottom) between the two batches, which were generated using different tissue samples and at a different time.
Figure 2 |
Figure 2 |. Overview of DeMixSC.
The DeMixSC framework for deconvolution analysis of bulk RNA-seq data using sc/sn RNA-seq data as a reference. A, The framework starts with a benchmark dataset of matched bulk and sc/snRNA-seq data with the same cell-type proportions. Pseudo-bulk mixtures are generated from the sc/sn data. DeMixSC identifies DE and non-DE genes between the matched real-bulk and pseudo-bulk data. The non-DE genes are considered stably captured by both sequencing platforms (blue), while the DE genes are highly affected by technological discrepancy (orange). B, DeMixSC then employs a normalization procedure to perform the alignment between two bulk RNA-seq datasets (e.g., with ComBat). C, DeMixSC estimates cell-type proportions by regression under a weighted nonnegative least square (wNNLS) framework with two improvements: 1) partitioning and adjusting genes with high technological discrepancy, and 2) a new weight function. Here, g is the index of gene, j is the index of subject, k is the index of cell type, pˆj is the estimated cell-type proportions, wjg is the weight, nj is the normalization constant, R^gk is the reference expression value derived from the sc/snRNA-seq data, ag is the log2 transformed mean expression of the matched bulk and pseudo-bulk RNA-seq data, yjg is the observed expression value in bulk RNA-seq data, and yˆjg is the corresponding fitted value.
Figure 3 |
Figure 3 |. Compare the estimation accuracy of DeMixSC to existing deconvolution methods.
A, Workflow for the deconvolution benchmarking design. We use benchmark data from retinal samples. The cell count proportions for each cell type are used as ground truth for the corresponding tissue samples. We assess the deconvolution performance of DeMixSC and seven existing methods for both bulk and pseudo-bulk mixtures. In addition to the raw counts, we also test RPM, RPKM, and TPM. The deconvolution performance is assessed by RMSE and MAE. B and C, Boxplots showing the deconvolution performance of eight deconvolution methods for the bulk and pseudo-bulk data. RMSE and MAE values are calculated across seven major cell types for each sample, with gray denoting pseudo-bulk and red denoting real-bulk. Smaller values indicate higher accuracy in proportion estimation. D, Boxplots showing the distributions of deconvolution estimates at the cell-type level for all 24 retinal samples. Each color corresponds to a given deconvolution method, with black denoting the ground truth, and each panel corresponds to a given cell type. E and F, An overview of deconvolution performance at the cell-type level across the eight methods using RMSE and MAE, respectively. Lighter colors correspond to lower RMSE or MAE values.
Figure 4 |
Figure 4 |. Using DeMixSC to deconvolve a large cohort of human peripheral retinal samples.
A, PCA plots of both the retina cohort data and the benchmark data. Red denotes the bulk data to be deconvolved, blue denotes the benchmark bulk data, and green denotes the benchmark pseudo-bulk data. B and C demonstrate the robustness of DeMixSC to different reference matrices at both cell-type and sample levels. Higher correlation coefficients indicate better performance. D, Distributions of DeMixSC estimated cell-type proportions of Ratnapriya et al. data using consensus references. Each panel corresponds to a given cell type. The P-values for Student’s t-tests comparing the estimated cell-type proportions between non-AMD (healthy) and AMD groups are denoted as follows: not significant (ns), P-value >0.05; *P-value ≤0.05; **P-value ≤0.01; and ***P-value ≤0.001.

References

    1. Haniffa M. et al. A roadmap for the human developmental cell atlas. Nature 597, 196–205 (2021). - PMC - PubMed
    1. Zeng Q. et al. Understanding tumour endothelial cell heterogeneity and function from single-cell omics. Nature Reviews Cancer 23, 544–564 (2023). - PubMed
    1. Gohil S. H., Iorgulescu J. B., Braun D. A., Keskin D. B. & Livak K. J. Applying high-dimensional single-cell technologies to the analysis of cancer immunotherapy. Nature Reviews Clinical Oncology 18, 244–256 (2020). - PMC - PubMed
    1. Li X. & Wang C. Y. From bulk, single-cell to spatial RNA sequencing. Investigative Ophthalmology & Visual Science 13, 1–6 (2021). - PMC - PubMed
    1. Stark R., Grzelak M. & Hadfield J. RNA sequencing: the teenage years. Nature Reviews Genetics 20, 631–656 (2019). - PubMed

Publication types