A deconvolution framework that uses single-cell sequencing plus a small benchmark data set for accurate analysis of cell type ratios in complex tissue samples

Shuai Guo^#¹, Xiaoqian Liu^#¹, Xuesen Cheng^#², Yujie Jiang^{1

3}, Shuangxi Ji¹, Qingnan Liang², Andrew Koval^{1

3}, Yumei Li², Leah A Owen^{4

5

6}, Ivana K Kim⁷, Ana Aparicio⁸, Sanghoon Lee⁹, Anil K Sood⁹, Scott Kopetz¹⁰, John Paul Shen¹⁰, John N Weinstein^{1

11}, Margaret M DeAngelis^{4

5

6

12}, Rui Chen^#², Wenyi Wang^#¹³

Affiliations

¹ Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA.
² Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA.
³ Department of Statistics, Rice University, Houston, Texas 77005, USA.
⁴ Department of Ophthalmology, Jacobs School of Medicine and Biomedical Engineering, SUNY University at Buffalo, Buffalo, New York 14209, USA.
⁵ Department of Population Health Sciences, University of Utah School of Medicine, Salt Lake City, Utah 84108, USA.
⁶ Department of Ophthalmology and Visual Sciences, University of Utah School of Medicine, Salt Lake City, Utah 84132, USA.
⁷ USA Retina Service, Harvard Medical School, Massachusetts Eye and Ear, Boston, Massachusetts 02114, USA.
⁸ Department of Genitourinary Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77230, USA.
⁹ Department of Gynecologic Oncology and Reproductive Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas 77230, USA.
¹⁰ Department of Gastrointestinal Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA.
¹¹ Department of Systems Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA.
¹² VA Western New York Healthcare System, Buffalo, New York 14215, USA.
¹³ Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA; wwang7@mdanderson.org.

^# Contributed equally.

PMID: 39586714
PMCID: PMC11789644
DOI: 10.1101/gr.278822.123

A deconvolution framework that uses single-cell sequencing plus a small benchmark data set for accurate analysis of cell type ratios in complex tissue samples

Shuai Guo et al. Genome Res. 2025.

. 2025 Jan 22;35(1):147-161.

doi: 10.1101/gr.278822.123.

Authors

Affiliations

¹ Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA.
² Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA.
³ Department of Statistics, Rice University, Houston, Texas 77005, USA.
⁴ Department of Ophthalmology, Jacobs School of Medicine and Biomedical Engineering, SUNY University at Buffalo, Buffalo, New York 14209, USA.
⁵ Department of Population Health Sciences, University of Utah School of Medicine, Salt Lake City, Utah 84108, USA.
⁶ Department of Ophthalmology and Visual Sciences, University of Utah School of Medicine, Salt Lake City, Utah 84132, USA.
⁷ USA Retina Service, Harvard Medical School, Massachusetts Eye and Ear, Boston, Massachusetts 02114, USA.
⁸ Department of Genitourinary Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77230, USA.
⁹ Department of Gynecologic Oncology and Reproductive Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas 77230, USA.
¹⁰ Department of Gastrointestinal Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA.
¹¹ Department of Systems Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA.
¹² VA Western New York Healthcare System, Buffalo, New York 14215, USA.
¹³ Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA; wwang7@mdanderson.org.

^# Contributed equally.

PMID: 39586714
PMCID: PMC11789644
DOI: 10.1101/gr.278822.123

Abstract

Bulk deconvolution with single-cell/nucleus RNA-seq data is critical for understanding heterogeneity in complex biological samples, yet the technological discrepancy across sequencing platforms limits deconvolution accuracy. To address this, we utilize an experimental design to match inter-platform biological signals, hence revealing the technological discrepancy, and then develop a deconvolution framework called DeMixSC using this well-matched, that is, benchmark, data. Built upon a novel weighted nonnegative least-squares framework, DeMixSC identifies and adjusts genes with high technological discrepancy and aligns the benchmark data with large patient cohorts of matched-tissue-type for large-scale deconvolution. Our results using two benchmark data sets of healthy retinas and ovarian cancer tissues suggest much-improved deconvolution accuracy. Leveraging tissue-specific benchmark data sets, we applied DeMixSC to a large cohort of 453 age-related macular degeneration patients and a cohort of 30 ovarian cancer patients with various responses to neoadjuvant chemotherapy. Only DeMixSC successfully unveiled biologically meaningful differences across patient groups, demonstrating its broad applicability in diverse real-world clinical scenarios. Our findings reveal the impact of technological discrepancy on deconvolution performance and underscore the importance of a well-matched data set to resolve this challenge. The developed DeMixSC framework is generally applicable for accurately deconvolving large cohorts of disease tissues, including cancers, when a well-matched benchmark data set is available.

PubMed Disclaimer

Figures

**Figure 1.**
Assessing technological discrepancy between bulk and single-cell sequencing platforms using matched single-nucleus aliquots. (A) Workflow for generating a benchmark data set. We collect 24 healthy human retinal samples within 6 h postmortem. An illustration shows the layer and cell compositions of the human retina. Seven major cell types include photoreceptors (rod and cone cells), bipolar cells (BCs), retinal ganglion cells (RGCs), horizontal cells (HCs), amacrine cells (ACs), and Müller glia cells (MGs). Three minor cell types are not depicted in the illustration: astrocytes, microglia cells, and retinal pigment epithelial cells (RPEs). Samples are isolated into single-nucleus suspensions. The same aliquot of single nucleus is used for both bulk and snRNA-seq profiling. The matched pseudobulk mixtures are generated as conventionally done by summing UMI counts across cells from all cell types in each sample. This data generation pipeline guarantees the matched bulk and snRNA-seq data share the same cell type proportions, which enables us to evaluate the impact of technological discrepancy (i.e., the shot-gun sequencing procedure) on the bulk and snRNA-seq expression profiles. (B,C) The influence of technological discrepancy at the sample and gene level, respectively. (B) Spearman's correlation coefficient across genes between the matched real-bulk and pseudobulk RNA-seq data for one sample at a time for both batches. The correlations were calculated using quantile-normalized expression data (relative abundances). (C) MA-plots displaying the mean expression levels of all genes between matched real-bulk and pseudobulk data. Differentially expressed (DE) genes are identified using the paired t-test with Benjamini–Hochberg (BH) adjustment. Red represents genes expressed higher in the real bulk, and blue represents genes expressed higher in the pseudobulk. The horizontal dotted lines denote a twofold change between matched real-bulk and pseudobulk data. (adj.p) Adjusted P-values. (D) Venn diagrams showing genes consistently expressed higher in the bulk (*top*, overlap of red dots in panel C) or the snRNA-seq generated pseudobulk (*bottom*, overlap of blue dots in panel C) between the two batches, which were generated using different tissue samples and a different time.

**Figure 2.**
Overview of DeMixSC. The DeMixSC framework for deconvolution analysis of bulk RNA-seq data using sc/sn RNA-seq data as a reference. (A) The framework starts with a benchmark data set of matched bulk and sc/snRNA-seq data with the same cell type proportions. Pseudobulk mixtures are generated from the sc/sn data. DeMixSC identifies genes in G₁ and G₂ with the matched real-bulk and pseudobulk data. The non-DE genes are considered stably captured by both sequencing platforms (blue), whereas the DE genes are more impacted by the technological discrepancy (orange). (B) DeMixSC then employs a normalization procedure to perform the alignment between two bulk RNA-seq data sets (e.g., with ComBat). (C) DeMixSC estimates cell type proportions under a weighted nonnegative least square (wNNLS) framework with two improvements: (1) partitioning and adjusting genes with high technological discrepancy and (2) a new weight function. The final estimates are obtained when the algorithm either converges or reaches the prespecified maximum number of iterations. Here, G₁ is genes with low technological discrepancy, G₂ is genes with high technological discrepancy, a is a user-defined positive constant that serves as an adjustment factor, $\hat{r}$ is the reference matrix derived from the sc/snRNA-seq data, y is the observed expression in bulk RNA-seq data, $\hat{p}$ is the vector of estimated cell type proportions, and $\hat{w}$ is the estimated gene weights.

**Figure 3.**
Comparing the estimation accuracy of DeMixSC to existing deconvolution methods. (A) Workflow for the deconvolution benchmarking design. We use benchmark data from retinal samples. The cell count proportions for each cell type are used as ground truth for the corresponding tissue samples. We assess the deconvolution performance of DeMixSC and seven existing methods for both bulk and pseudobulk mixtures. In addition to the raw counts, we also test RPM, RPKM, and TPM. The deconvolution performance is assessed by RMSE and Spearman's correlation coefficient. Note the results by SQUID are discussed in the text only. (B,C) Boxplots showing the deconvolution performance of eight deconvolution methods for the bulk and pseudobulk data. RMSE and Spearman's correlation coefficient values are calculated across seven major cell types for each sample, with gray denoting pseudobulk and red denoting real bulk. Smaller RMSEs or larger Spearman's correlations indicate a higher accuracy in proportion estimation. (D) Boxplots showing the distributions of deconvolution estimates at the cell type level for all 24 retinal samples. Each color corresponds to a given deconvolution method, with black denoting the ground truth, and each panel corresponds to a given cell type. (E,F), An overview of deconvolution performance at the cell type level across the eight methods using RMSE and Spearman's correlation coefficient, respectively. Lighter colors correspond to lower RMSE or Spearman's correlation coefficient values. Gray indicates NA.

**Figure 4.**
Using DeMixSC to deconvolve a large cohort of human peripheral retinal samples. (A) PCA plots of both the retina cohort data and the benchmark data. Red denotes the bulk data to be deconvolved; blue denotes the benchmark bulk data; and green denotes the benchmark pseudobulk data. (B,C) Panels demonstrating the robustness of DeMixSC to different reference matrices at both the cell type and sample levels. Higher correlation coefficients indicate better performance. (D) Distributions of DeMixSC estimated cell type proportions of Ratnapriya et al. (2019) data using consensus references. Each panel corresponds to a given cell type. The P-values for Student's t-tests comparing the estimated cell type proportions between non-AMD (healthy) and AMD groups are denoted as follows: (ns) not significant, P-value > 0.05; (*) P-value ≤ 0.05; (**) P-value ≤ 0.01; and (***) P-value ≤ 0.001.

**Figure 5.**
Using DeMixSC to deconvolve HGSC samples. (A) Boxplots showing the deconvolution performance of eight deconvolution methods for the pseudobulk and three types of bulk data in the HGSC benchmark data set. RMSE values and Spearman's correlation coefficients are calculated across 13 cell types for each sample. Smaller RMSEs or larger Spearman's correlations indicate higher accuracy in proportion estimation. (B) Distributions of DeMixSC estimated cell type proportions of Lee et al. (2020) data using consensus references. Each panel corresponds to a given cell type. (NK cells) natural killer cells, (ILC) innate lymphoid cells, (DC) dendritic cells macrophages, and (pDC) plasmacytoid dendritic cells. The P-values for Student's t-tests comparing the estimated cell type proportions across R0, ER, and PR groups are denoted as follows: (ns) not significant, P-value > 0.05; (*) P-value ≤ 0.05; (**) P-value ≤ 0.01; and (***) P-value ≤ 0.001. (C) Scatter plot comparing DeMixSC estimates of macrophages with immunofluorescent measures (CD68/CD163) in 21 HGSC samples. The black dashed line represents the diagonal, and the gray solid line indicates the linear fit across the data points.

See this image and copyright information in PMC

Update of

DeMixSC: a deconvolution framework that uses single-cell sequencing plus a small benchmark dataset for improved analysis of cell-type ratios in complex tissue samples.
Guo S, Liu X, Cheng X, Jiang Y, Ji S, Liang Q, Koval A, Li Y, Owen LA, Kim IK, Aparicio A, Shen JP, Kopetz S, Weinstein JN, DeAngelis MM, Chen R, Wang W. Guo S, et al. bioRxiv [Preprint]. 2023 Nov 11:2023.10.10.561733. doi: 10.1101/2023.10.10.561733. bioRxiv. 2023. Update in: Genome Res. 2025 Jan 22;35(1):147-161. doi: 10.1101/gr.278822.123. PMID: 37873318 Free PMC article. Updated. Preprint.

References

1. Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A. 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12: R18. 10.1186/GB-2011-12-2-R18 - DOI - PMC - PubMed
1. Aliee H, Theis FJ. 2021. AutoGeneS: automatic gene selection using multi-objective optimization for RNA-seq deconvolution. Cell Syst 12: 706–715.e4. 10.1016/J.CELS.2021.05.006 - DOI - PubMed
1. Ambati J, Atkinson JP, Gelfand BD. 2013. Immunology of age-related macular degeneration. Nat Rev Immunol 13: 438–451. 10.1038/nri3459 - DOI - PMC - PubMed
1. Anghel CV, Quon G, Haider S, Nguyen F, Deshwar AG, Morris QD, Boutros PC. 2015. ISOpureR: an R implementation of a computational purification algorithm of mixed tumour profiles. BMC Bioinformatics 16: 156. 10.1186/S12859-015-0597-X - DOI - PMC - PubMed
1. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57: 289–300. 10.1111/J.2517-6161.1995.TB02031.X - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A deconvolution framework that uses single-cell sequencing plus a small benchmark data set for accurate analysis of cell type ratios in complex tissue samples

Affiliations

A deconvolution framework that uses single-cell sequencing plus a small benchmark data set for accurate analysis of cell type ratios in complex tissue samples

Authors

Affiliations

Abstract

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical