Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 4;39(5):btad278.
doi: 10.1093/bioinformatics/btad278.

ICAT: a novel algorithm to robustly identify cell states following perturbations in single-cell transcriptomes

Affiliations

ICAT: a novel algorithm to robustly identify cell states following perturbations in single-cell transcriptomes

Dakota Y Hawkins et al. Bioinformatics. .

Abstract

Motivation: The detection of distinct cellular identities is central to the analysis of single-cell RNA sequencing (scRNA-seq) experiments. However, in perturbation experiments, current methods typically fail to correctly match cell states between conditions or erroneously remove population substructure. Here, we present the novel, unsupervised algorithm Identify Cell states Across Treatments (ICAT) that employs self-supervised feature weighting and control-guided clustering to accurately resolve cell states across heterogeneous conditions.

Results: Using simulated and real datasets, we show ICAT is superior in identifying and resolving cell states compared with current integration workflows. While requiring no a priori knowledge of extant cell states or discriminatory marker genes, ICAT is robust to low signal strength, high perturbation severity, and disparate cell type proportions. We empirically validate ICAT in a developmental model and find that only ICAT identifies a perturbation-unique cellular response. Taken together, our results demonstrate that ICAT offers a significant improvement in defining cellular responses to perturbation in scRNA-seq data.

Availability and implementation: https://github.com/BradhamLab/icat.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Overview of the ICAT algorithm. (A) The schematic illustrates the ICATC implementation of ICAT. To identify cell states across treatments, ICAT first performs self-supervised feature weighting to find genes that discriminate cell identities among control cells alone, followed by semisupervised clustering using the newly transformed expression matrix. To learn feature weights, ICAT clusters control cells using Louvain community detection, then uses the newly generated cluster labels as input into NCFS to weight genes by their predictiveness. After applying the learned gene weights to the original gene expression matrix, ICAT clusters both treated and control cells together using a semisupervised implementation of Louvain community detection. During this process, ICAT holds the previously identified cluster labels for the control cells immutable. (B) The schematic illustrates the ICATC + T implementation, which expands feature weighting to treated cells to identify asymmetrical populations between treatments. Cells are split along treatments and independently clustered using the Louvain method; then, cluster labels are used to learn gene weights using NCFS in each treatment set independently. To retain asymmetrically informative genes, weights for each treatment are concatenated row wise and subsequently reduced to the maximum weight using a row-wise maxpool function. The reduced weight vector is then used to transform the original count matrix.
Figure 2.
Figure 2.
ICAT correctly identifies cell states in distinct experimental compositions. (A) UMAP projections of different cellular compositions in simulated datasets. Each dot represents a cell with circles representing control cells and crosses denoting treated cells. Dots are colored by ground truth identity (left column), cluster label produced by performing Louvain community detection on the raw count matrix (middle column), and clusters labeled produced by ICAT (right column). (B) Average agreement between ground truth label and cluster labels produced by clustering the raw data (blue) and ICAT (orange) as measured by the ARI. Error bars represent the 95% CIs for the mean ARI for each method. Five different cellular composition conditions were simulated: “All same,” both control and treated cells share the same two cell states; “Rx Unique,” treated cells contain a treatment-unique cell state; “Control Unique,” control cells contain a unique cell state; “Both Unique,” both treated and control cells contain treatment-specific cell states; and “None Same,” no shared cell states between treated and control cells. Each condition was simulated 15 times (n = 15). Simulations were evaluated using the ICATC implementation.
Figure 3.
Figure 3.
ICAT outperforms current methods for cell state identification and is robust to experimental conditions. (A) UMAP projections of raw (left), ICATC processed (middle), and ICATC+T processed count matrices for the simulated data. Projections show ICATC and ICATC + T correctly mix shared populations (red and green), whereas only ICATC + T isolates asymmetrical populations (purple, yellow). ICAT performance for simulated data was further evaluated using the ICATC + T implementation only. (B) Select marker and perturbed gene expression patterns are displayed as violin plots for three simulated control cell types (1–3) under normal (C) and perturbed (P) conditions. P(C1)+ is a stimulated and perturbed version of cell type 1; perturbation-specific cell types P4 and P5 express distinct marker genes. (C) The percentage of perturbed genes used to assess robustness to perturbation severity is shown; values range from 1% to 25%. (D) The average number of marker genes per cell identity used to assess robustness to signal strength is shown; values range from 10 to 105 (0.7%–7% of total genes). (E) The set of cell identity proportions used to test the ability to identify rare cell states is shown. The number of cells per treatment-label pair ranges from 50 to 175. (F) Method performance is compared as the fraction of perturbed genes increases (F1) and as the average number of marker genes per population increases (F2). Left panels display results for stand-alone methods: ICAT, No Int. Seurat, Scanorama, and Harmony; while right panels show results for ICAT-extended workflows: ICATSeurat, ICATScan, and ICATHarm. Results are depicted as averages with 95% CIs shown by shading. (G) Method performance is compared as the proportion of cell types is varied. The Gini coefficient reflects the degree of population size disproportion among cell states. Left panel: ICAT, Seurat, Scanorama, and Harmony. Right panel: ICATSeurat, ICATScan, and ICATHarm.
Figure 4.
Figure 4.
ICAT outperforms current integration methods in identifying cell states across treatments in real datasets. scRNA-seq data from Tian, Kagohara, and Kang studies is compared. (A) Spider plots compare the performance for each algorithm within each dataset for ARI, LISI, and DB quality metrics. (B) Lollipop plots highlight the differences in the metrics for each method across the same datasets. (C) Spider plots comparing Seurat to ICATSeurat, Scanorama to ICATScan, and Harmony to ICATHarm, respectively. All metrics are scaled from 0 to 1, where 1 is best (closest to the apex of each corner). DB, Davies–Bouldin metric, and otherwise as in Fig. 2. ICAT performance for the Tian dataset was evaluated using the ICATC + T implementation, whereas Kagohara and Kang datasets were evaluated using the ICATC implementation.
Figure 5.
Figure 5.
ICAT most accurately defines subpopulation response to perturbation. SMART-seq2 was performed on cells isolated from controls or from embryos treated with the perturbants chlorate or MK-886. (A) ICAT predicts five clusters from the combined data (A1), with treatment-dependent subpopulation compositions (G-test likelihood test, FDR < 0.05; ad-hoc pairwise Fisher’s Exact test, FDR < 0.05) (A2). MK-induced differences are defined by reciprocal expression patterns of ICAT-identified genes, sm50 and pks2, in subpopulations 3 (sm50+/pks2−) and 4 (sm50−/pks2+) (A3). (B) ICAT predictions for control (B1, B3) and MK-866 (B2, B4) are validated by HCR FISH analysis (B5, B6). Automated detection and segmentation of PMCs in vivo from 3D image data are shown as various colors; each color represents an individual cell (1–2). The expression of sm50 (red) and pks2 (cyan) transcripts are shown in the same embryos; arrowheads indicate PMCs that express only sm50 or only pks2 (3–4). MK-866 treated embryos exhibit statistically significantly fewer sm50+/pks2− PMCs (B5) and more sm50/pks2+ PMCs (B6) than controls, consistent with the predictions from ICAT (A). Dot plot error bars denote 95% CIs of the mean with each dot representing an individual embryo [nDMSO = 26, nMK886 = 37; B5 (Binomial GLM, μDMSO = 3.59%, μMK886 = 1.91%; βMK886 = −0.71, P < 10−3; B6 Binomial GLM, μDMSO = 0.19%, μMK886 = 2.71%; βMK886 = 2.60, P < 10−4)]. (C) Seurat (C1) fails to find any treatment effect on PMC subpopulation composition, while Scanorama (C2), and Harmony (C3) predict one and two disrupted populations, respectively (G-test likelihood test, FDR < 0.05; ad hoc pairwise Fisher’s Exact test, FDR < 0.05).

References

    1. Barron M, Zhang S, Li J. et al. A sparse differential clustering algorithm for tracing cell type changes via single-cell RNA-sequencing data. Nucleic Acids Res 2018;46:e14. - PMC - PubMed
    1. Berg S, Kutra D, Kroeger T. et al. Ilastik: interactive machine learning for (bio)image analysis. Nat Methods 2019;12:1226–32. - PubMed
    1. Blondel VD, Guillaume J-L, Lambiotte R. et al. Fast unfolding of communities in large networks. J Stat Mech 2008;2008:P10008.
    1. Büttner M, Miao Z, Wolf FA. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods 2019;16:43–9. - PubMed
    1. Choi HMT, Schwarzkopf M, Fornace ME. et al. Third-generation in situ hybridization chain reaction: multiplexed, quantitative, sensitive, versatile, robust. Development 2018;145:dev165753. - PMC - PubMed

Publication types

LinkOut - more resources