AdRoit is an accurate and robust method to infer complex transcriptome composition

Affiliations

¹ Regeneron Pharmaceuticals, Inc., Tarrytown, NY, 10591, USA.
² Cellular Longevity, Inc., San Francisco, CA, 94103, USA.
³ Regeneron Pharmaceuticals, Inc., Tarrytown, NY, 10591, USA. yu.bai@regeneron.com.

^# Contributed equally.

PMID: 34686758
PMCID: PMC8536787
DOI: 10.1038/s42003-021-02739-1

AdRoit is an accurate and robust method to infer complex transcriptome composition

Tao Yang et al. Commun Biol. 2021.

. 2021 Oct 22;4(1):1218.

doi: 10.1038/s42003-021-02739-1.

Authors

Affiliations

¹ Regeneron Pharmaceuticals, Inc., Tarrytown, NY, 10591, USA.
² Cellular Longevity, Inc., San Francisco, CA, 94103, USA.
³ Regeneron Pharmaceuticals, Inc., Tarrytown, NY, 10591, USA. yu.bai@regeneron.com.

^# Contributed equally.

PMID: 34686758
PMCID: PMC8536787
DOI: 10.1038/s42003-021-02739-1

Abstract

Bulk RNA sequencing provides the opportunity to understand biology at the whole transcriptome level without the prohibitive cost of single cell profiling. Advances in spatial transcriptomics enable to dissect tissue organization and function by genome-wide gene expressions. However, the readout of both technologies is the overall gene expression across potentially many cell types without directly providing the information of cell type constitution. Although several in-silico approaches have been proposed to deconvolute RNA-Seq data composed of multiple cell types, many suffer a deterioration of performance in complex tissues. Here we present AdRoit, an accurate and robust method to infer the cell composition from transcriptome data of mixed cell types. AdRoit uses gene expression profiles obtained from single cell RNA sequencing as a reference. It employs an adaptive learning approach to alleviate the sequencing technique difference between the single cell and the bulk (or spatial) transcriptome data, enhancing cross-platform readout comparability. Our systematic benchmarking and applications, which include deconvoluting complex mixtures that encompass 30 cell types, demonstrate its preferable sensitivity and specificity compared to many existing methods as well as its utilities. In addition, AdRoit is computationally efficient and runs orders of magnitude faster than most methods.

PubMed Disclaimer

Conflict of interest statement

T.Y., Y.B., W.F., and G.S.A. have filed a patent application relating to the AdRoit computational framework. M.L.-F. is an employee of Cellular Longevity. The remaining authors are employees and shareholders of Regeneron Pharmaceuticals, although the manuscript’s subject matter does not have any relationship to any products or services of this corporation.

Figures

**Fig. 1. Schematic representation of AdRoit computational framework.**
a AdRoit inputs compound (bulk or spatial) RNA-Seq data, single-cell RNA-Seq data, and cell type annotations. It first selects informative genes and estimates their means and dispersions, then computes the cell type specificity of genes. Depending on the availability of multiple samples, cross-sample gene variability is derived from either the compound RNA-Seq, or the single-cell data (see also “Methods”). Lastly the gene-wise correction factors are computed to reduce the platform bias between the compound and the single-cell RNA-Seq data. These quantities are used in a weighted regularized model to infer the cell type composition. b A mock example to illustrate the role of the gene-wise correction factor. Conceptually, an accurate estimation of the cell proportions should be represented by the slope of the green line; however, fitting in the presence of outlier genes would result in the red line. Outlier genes exist because the platform bias affects genes differently. AdRoit adopts an adaptive learning approach that first learns a coarse estimation of the slope (red line), from which the gene-wise corrections are derived and applied to the outlier genes, moving them toward the green line. The more deviated the gene, the larger the correction (i.e., longer arrows). After the adjustment, the new estimated slope (blue line) is closer to the truth (green line) and thus is a more accurate estimation.

**Fig. 2. Benchmark on simulated bulk data generated from the trabecular meshwork (TM) single cells.**
a AdRoit has the closest estimation to the true cell proportion comparing to Bisque, MuSiC, and SPOTlight. Each dot is a cell type from a donor. The performance metrics were derived from eight distinct donors. b For each cell type in TM, AdRoit has the smallest differences from the true cell type proportion and the smallest variance of estimates across eight distinct donors. For each cell type, a dot on the graph denotes a donor, and the bars represent the 1.5× interquartile ranges. The reference and gene weight estimations used for deconvoluting each synthetic bulk sample exclude the data from that sample (leave-one-out).

**Fig. 3. AdRoit can achieve a high granularity and exhibits good sensitivity and specificity in complex tissues.**
a AdRoit is accurate in deconvoluting the simulated bulk samples that contain a mixture of similar cell types from myeloid or lymphoid lineage. The vertical dashed lines indicate the true mixing proportions. *CD14*⁺ monocytes, *FCGR3A*⁺ monocytes and dendritic cells (DC) were mixed under three schemes of proportions: 0.33:0.33:0.33 (mix0), 0.1:0.45:0.45 (mix1) and 0.1:0.3:0.6 (mix2). The same ratios were applied to the mixtures of naïve *CD4*⁺ T, memory *CD4*⁺ T, and *CD8*⁺ cells. Each boxplot was derived based on n = 100 independent simulations, with bars denoting the 1.5× interquartile ranges. b AdRoit’s estimates are more accurate and specific than those from Bisque, MuSiC, and SPOTlight on synthetic samples that contain only 6 out of the 12 cell types. The deconvolution was done using all 12 cell types as the reference. A pair of size-matched blue (true value) and red (estimated value) bubbles indicate an accurate prediction. Red-only and blue-only bubbles mark false positives and false negatives, respectively. c The comparison of Receiver operating characteristic (ROC) curves (n = 8 independent donors) shows that AdRoit has a notable higher area under the curve (AUC) than other methods, meaning better sensitivity and specificity. d Scatterplots between the ground truth and the deconvoluted cell proportions in the simulated bulk samples of high complexity (mixtures of 30 cell types). e ROC curves (n = 100 independent simulations) show AdRoit has the best AUC among all methods on highly complex cell constitutions.

**Fig. 4. Benchmark on simulated bulk data generated using mouse dorsal root ganglion (DRG) cells containing closely related subtypes of neurons.**
a 14 cell types are identified from scRNA-Seq samples of 5 mice, including multiple subtypes of neurofilaments (NF), peptidergic (PEP), and non-peptidergic (NP) neurons. b Benchmarking with the synthetic data shows the cell type proportions inferred by AdRoit are more accurate. In particular, AdRoit remains a better accuracy when the cells are rare (e.g., <5%; see also the zoom-in inserts). Each dot represents a cell type from one sample. c For each sample, mAD, RMSD, Pearson, and Spearman correlations are compared across four methods. AdRoit has the lowest mAD and RMSD, and the highest Pearson and Spearman correlations. In addition, AdRoit’s estimation is the most stable across samples. Each boxplot was generated based on n = 5 distinct mice (one dot represents one animal). The bar of each boxplot indicates the 1.5× interquartile range. Same animals are chained by the dotted lines across the methods. The deconvolution was done by using the leave-one-out strategy.

**Fig. 5. AdRoit shows a good accuracy and sensitivity in deconvoluting spatial spots simulated from dorsal root ganglion cells.**
a Estimations from AdRoit, Cell2location, Stereoscope, and SPOTlight on simulated spatial spots that contain 5 PEP neuron subtypes. True mixing proportions are denoted by the red dashed lines. Three schemes are presented: (1) the proportions of 5 PEP cell types are the same and equal to 0.2; 2) PEP1_Dcn is 0.1 and the other 4 are 0.225; 3) PEP1_Dcn and PEP1_S100a11.Tagln2 are 0.1, PEP1_Slc7a3.Sstr2 and PEP2_Htr3a.Sema5a 0.2 are 0.2, and PEP3_Trpm8 is 0.4. The boxplots were derived from n = 100 independent simulations. b The performance of AdRoit, Cell2location, Stereoscope, and SPOTlight in estimating rare cell populations in the spatial spots. The spots contain a mixture of three PEP cell subtypes (i.e., PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a, and PEP3_Trpm8), with the percent of PEP3_Trpm8 ranging from 1 to 10% and the other two cell types sharing the remaining proportion equally. The boxplots were drawn upon n = 100 independent simulations. c Compare the rate of detecting rare cells in simulated spots. An inferred percent greater than 0.5% is deemed as a positive detection. Six sets of cell mixtures are employed: NF_Calb1 with NF_Pvalb and NF2_Ntrk2.Necab2 (NF subtypes), NP_Mrgpra3 with NP_Mrgprd and NP_Nts (NP subtypes), PEP3_Trpm8 with PEP1_Slc7a3.Sstr2 and PEP2_Htr3a.Sema5a (PEP subtypes), NF_Calb1 with Th, satellite glia and endothelial (NF_Calb1 + others), NP_Mrgpra3 with Th, satellite glia and endothelial (NP_Mrgpra3 + others), and PEP_Trpm8 with Th, satellite glia and endothelial (PEP_Trpm8 + others). In each set, the first cell type listed is the target of detection and varies its percent from 1 to 10%. The rest cell types split the remaining proportion evenly. The red dashed lines mark the detection rate of 90%. The rates were computed based on n = 100 independent simulations. Bars in the boxplots mark the 1.5× interquartile ranges.

**Fig. 6. Applications to real bulk RNA-Seq data and mouse brain spatial transcriptome data.**
a The deconvoluted cell compositions in the real bulk RNA-Seq data of human Islets are highly reproducible for the repeated samples from the same donor. b AdRoit estimation of the cell type proportions agrees with the RNA-FISH measurements. c AdRoit-inferred Beta-cell proportions in type 2 diabetes patients (n = 13 distinct subjects) are significantly lower than those in healthy subjects (n = 26 distinct subjects). Bars in the boxplots represent the 1.5× interquartile ranges. In addition, the estimated proportions have a significant negative linear association with the HbA1C levels (n = 36 distinct donors with valid HbA1C measurements). All statistical metrics were derived based on. d The spatial mapping of four mouse brain cell types is consistent with the locations of four region-specific markers shown on the ISH images obtained from Allen mouse brain atlas. The four genes, *Spink8*, *C1ql2*, *Clic6*, and *Synpo2*, were identified by Zeisel et al. as markers of the hippocampal field CA1, dentate gyrus, choroid plexus, and thalamus, respectively.

See this image and copyright information in PMC

References

1. Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009 doi: 10.1038/nrg2484. - DOI - PMC - PubMed
1. Chu GC, Kimmelman AC, Hezel AF, DePinho RA. Stromal biology of pancreatic cancer. J. Cell. Biochem. 2007 doi: 10.1002/jcb.21209. - DOI - PubMed
1. Bussard KM, Mutkus L, Stumpf K, Gomez-Manzano C, Marini FC. Tumor-associated stromal cells as key contributors to the tumor microenvironment. Breast Cancer Res. 2016 doi: 10.1186/s13058-016-0740-2. - DOI - PMC - PubMed
1. Munn DH, Bronte V. Immune suppressive mechanisms in the tumor microenvironment. Curr. Opin. Immunol. 2016 doi: 10.1016/j.coi.2015.10.009. - DOI - PMC - PubMed
1. Gonzalez H, Hagerling C, Werb Z. Roles of the immune system in cancer: from tumor initiation to metastatic progression. Genes Dev. 2018 doi: 10.1101/GAD.314617.118. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AdRoit is an accurate and robust method to infer complex transcriptome composition

Affiliations

AdRoit is an accurate and robust method to infer complex transcriptome composition

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases