. 2020 Jun;16(6):e9389.

doi: 10.15252/msb.20199389.

scClassify: sample size estimation and multiscale classification of cells using single and multiple reference

Yingxin Lin^{1

2}, Yue Cao^{1

2}, Hani Jieun Kim^{1

2

3}, Agus Salim^{4

5

6}, Terence P Speed⁶, David M Lin⁷, Pengyi Yang^{1

2

3}, Jean Yee Hwa Yang^{1

2}

Affiliations

¹ School of Mathematics and Statistics, University of Sydney, Sydney, NSW, Australia.
² Charles Perkins Centre, University of Sydney, Sydney, NSW, Australia.
³ Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, Australia.
⁴ Department of Mathematics and Statistics, La Trobe University, Bundoora, VIC, Australia.
⁵ Baker Heart and Diabetes Institute, Melbourne, VIC, Australia.
⁶ Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia.
⁷ Department of Biomedical Sciences, Cornell University, Ithaca, NY, USA.

PMID: 32567229
PMCID: PMC7306901
DOI: 10.15252/msb.20199389

scClassify: sample size estimation and multiscale classification of cells using single and multiple reference

Yingxin Lin et al. Mol Syst Biol. 2020 Jun.

. 2020 Jun;16(6):e9389.

doi: 10.15252/msb.20199389.

Authors

Yingxin Lin^{1

2}, Yue Cao^{1

2}, Hani Jieun Kim^{1

2

3}, Agus Salim^{4

5

6}, Terence P Speed⁶, David M Lin⁷, Pengyi Yang^{1

2

3}, Jean Yee Hwa Yang^{1

2}

Affiliations

¹ School of Mathematics and Statistics, University of Sydney, Sydney, NSW, Australia.
² Charles Perkins Centre, University of Sydney, Sydney, NSW, Australia.
³ Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, Australia.
⁴ Department of Mathematics and Statistics, La Trobe University, Bundoora, VIC, Australia.
⁵ Baker Heart and Diabetes Institute, Melbourne, VIC, Australia.
⁶ Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia.
⁷ Department of Biomedical Sciences, Cornell University, Ithaca, NY, USA.

PMID: 32567229
PMCID: PMC7306901
DOI: 10.15252/msb.20199389

Abstract

Automated cell type identification is a key computational challenge in single-cell RNA-sequencing (scRNA-seq) data. To capitalise on the large collection of well-annotated scRNA-seq datasets, we developed scClassify, a multiscale classification framework based on ensemble learning and cell type hierarchies constructed from single or multiple annotated datasets as references. scClassify enables the estimation of sample size required for accurate classification of cell types in a cell type hierarchy and allows joint classification of cells when multiple references are available. We show that scClassify consistently performs better than other supervised cell type classification methods across 114 pairs of reference and testing data, representing a diverse combination of sizes, technologies and levels of complexity, and further demonstrate the unique components of scClassify through simulations and compendia of experimental datasets. Finally, we demonstrate the scalability of scClassify on large single-cell atlases and highlight a novel application of identifying subpopulations of cells from the Tabula Muris data that were unidentified in the original publication. Together, scClassify represents state-of-the-art methodology in automated cell type identification from scRNA-seq data.

Keywords: cell type hierarchy; cell type identification; multiscale classification; sample size estimation; single-cell.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflict of interest.

Figures

**Figure 1. scClassify framework and ensemble model construction (see also Fig EV1)**
Schematic illustration of the scClassify framework. Gene selections: DE, differentially expressed; DD, differentially distributed; DV, differentially variable; BD, bimodally distributed; DP, differentially expressed proportions. Similarity metrics: P, Pearson's correlation; S, Spearman's correlation; K, Kendall's correlation; J, Jaccard distance; C, cosine distance; W, weighted rank correlation.
Schematic illustration of the joint classification using multiple reference datasets.
Classification accuracy of all pairs of reference and test datasets was calculated using all combinations of six similarity metrics and five gene selection methods.
Improvement in classification accuracy after applying an ensemble learning model over the best single model (i.e. weighted kNN + Pearson + DE).

**Figure EV1. scClassify evaluation framework and data collections. Related to Fig 1**
Evaluation framework used in this study. Predictions are classified into “correctly classified”, “misclassified”, “intermediate” (either correct or incorrect), “incorrectly unassigned”, “incorrectly assigned” or “correctly unassigned”.
All datasets used in this study, including two collections for benchmarking and five collections for case study in sample size learning and rare cell type identification.

**Figure 2. Benchmarking scClassify against alternative methods on cell type classification of the pancreas and PBMC data collection (see also Fig EV2)**
Performance evaluation for 16 methods on 30 training and test data pairs from the pancreas data collection. Each boxplot ranges from the first to third quartile of classification accuracy for each method with the median of classification accuracy as the horizontal line. The lower and higher whiskers of boxplot are extended to the first quartile minus 1.5 interquartile range and the third quartile plus 1.5 interquartile, respectively.
Performance evaluation for 16 methods on 84 training and test data pairs from the PBMC data collection. Each boxplot ranges from the first to third quartile of classification accuracy for each method with the median of classification accuracy as the horizontal line. The lower and higher whiskers of boxplot are extended to the first quartile minus 1.5 interquartile range and the third quartile plus 1.5 interquartile, respectively.
A 1‐by‐3 panel of dot plots indicating the rankings of each method in 114 pairs of reference and testing data pairs. The x‐axis refers to the different methods, and the y‐axis refers to the reference and testing data pairs. The dots are coloured by rank, and the size of the dots indicates the degree of accuracy.

**Figure EV2. Benchmark results for 16 different methods and sensitivity analysis of hyperparameters of scClassify. Related to Fig 2**
Benchmarking results for 16 different methods. Each bar indicates the composition of predicted categories of the average performance in a collection of reference–testing pairs. We divided reference–test pairs into four groups: pancreas (easy), pancreas (hard), PBMC (level 1) and PBMC (level 2).
Each box indicates the classification accuracy by subsampling 80% of training data, repeated 10 times, using 30 training and test data pairs from the pancreas data collection. The red dots indicate the classification accuracy of using the full training data. Each boxplot ranges from the first to third quartile of classification accuracy with the median as the horizontal line. The lower and higher whiskers of boxplot are extended to the first quartile minus 1.5 interquartile range and the third quartile plus 1.5 interquartile, respectively.
Each box indicates the classification accuracy with different number of nearest neighbours (k = 5, 10, 15 and 20) to be considered in the weighted kNN, using 30 training and test data pairs from the pancreas data collection, with 16 easy cases (top panel) and 14 hard cases (bottom panel). Each boxplot ranges from the first to third quartile of classification accuracy with the median as the horizontal line. The lower and higher whiskers of boxplot are extended to the first quartile minus 1.5 interquartile range and the third quartile plus 1.5 interquartile, respectively.
Each box indicates the classification accuracy with different correlation thresholds determined, either dynamic or pre‐defined (ranging from 0 to 0.8), using 30 training and test data pairs from the pancreas data collection, with 16 easy cases (top panel) and 14 hard cases (bottom panel). Each boxplot ranges from the first to third quartile of classification accuracy with the median as the horizontal line. The lower and higher whiskers of boxplot are extended to the first quartile minus 1.5 interquartile range and the third quartile plus 1.5 interquartile, respectively.

**Figure 3. Sample size estimation for cell type classification (see also Fig EV3)**
Schematic illustration of the scClassify sample size learning framework.
Scatter plot of sample size estimation based on the pilot data (horizontal axis) compared with accuracy results from the *in silico* experiments (vertical axis).
Sample size estimation from the PBMC data collection. Sample size learning curve with the horizontal axis representing sample size (N) and the vertical axis representing classification accuracy. The learning curves for the different datasets provide estimates of the sample size required to identify cell types at the top (top panel) and second (bottom panel) levels of the cell type hierarchical tree.

**Figure EV3. Sample size estimation results. Related to Fig 3**
A 2‐by‐2 panel of collections of boxplots demonstrating the validation of the sample size calculation using the PBMC10k dataset. The x‐axis indicates the sample size (N), and the y‐axis indicates the accuracy rate. The left panel indicates the results for the pilot data (20% of the full dataset), and the right panel indicates the results for the reference–test data (the remaining 80% data), representing the data that would be obtained in a follow‐up experiment. The top panel indicates the results of predicting PBMC at the top level of the cell type tree, while the bottom panel indicates the results of cell type prediction at the second level of the cell type tree. Each boxplot ranges from the first to third quartile of classification accuracy with the median as the horizontal line. The lower and higher whiskers of boxplot are extended to the first quartile minus 1.5 interquartile range and the third quartile plus 1.5 interquartile, respectively.
The fitted learning curves on the same data where red solid lines indicate the learning curves by fitting mean accuracy rate of pilot data; red dashed lines are the learning curves obtained by fitting the learning curves to the upper (75%) and lower (25%) quartile of accuracy rate of pilot data. The blue lines indicate the learning curves by fitting the mean of the accuracy rate for the follow‐up reference and test dataset.
Down‐sampling of the PBMC10k data using DECENT's beta‐binomial capture model (Ye *et al*, 2019). Boxplots indicate the accuracy rates of the cell predictions from down‐sampled data for the top (red) and second level (blue) of the cell type tree. Each boxplot ranges from the first to third quartile of classification accuracy with the median as the horizontal line. The lower and higher whiskers of boxplot are extended to the first quartile minus 1.5 interquartile range and the third quartile plus 1.5 interquartile, respectively. The x‐axis indicates the down‐sampling parameter of the beta‐binomial distribution (that is, the ratio of capture efficiency in the down‐sampled dataset relative to the original dataset), and the y‐axis denotes the accuracy rate.

**Figure 4. *Post hoc* clustering of unassigned cells and joint classification of cell types using multiple reference datasets. (see also Fig EV4)**
Left panel shows cell types based on the original publication by Muraro *et al* (2016), Data ref: Muraro *et al* (2016). Middle panel shows the predicted cell types from scClassify trained on the reference dataset by Xin *et al* (2016), Data ref: Xin *et al* (2016). Note that the reference dataset does not contain the cell types acinar, ductal and stellate cells. Right panel shows *post hoc* clustering and cell typing results for cells that remained unassigned in the scClassify prediction.
Joint classification on the PBMC data collection. Classifying query datasets using the joint prediction from multiple reference datasets (red circle). Classification accuracy as well as unassigned and intermediate rate of the joint prediction is compared to that obtained from using single reference datasets (other colours).

**Figure EV4. *Post hoc* clustering and validation by marker genes. Related to Fig 4**
Heatmap of the top 20 differentially expressed genes from each of the five cell type clusters generated through *post hoc* clustering of the Xin‐Muraro data pair. Here, Xin *et al* data are used as the reference dataset and Muraro *et al* data as the query dataset. The heatmap is coloured by the log‐transformed expression values. The red rectangles indicate markers that are consistent with those found in the original study.
A 1‐by‐3 panel of tSNE plots of Wang *et al* from the human pancreas data collection colour‐coded by original cell types given in Wang *et al* (2016) (left panel), the scClassify label generated using Xin *et al* as the reference dataset (middle panel) and the scClassify predicted cell types after performing *post hoc* clustering (right panel).
Heatmap of the top 20 differentially expressed genes from each of the two cell type clusters generated from *post hoc* clustering of the Xin‐Wang data pair. The heatmap is colour‐coded by the log‐transformed expression level. The red rectangles indicate markers that are consistent with those found in the original study.

**Figure 5. Cross‐platform classification of cell types using scClassify**
tSNE visualisation of the Tabula Muris Microfluidic dataset (The Tabula Muris Consortium, 2018; Data ref: The Tabula Muris Consortium 2018). Cell typing is based on either the original publication (left panel) or scClassify prediction (right panel). scClassify was applied to the Tabula Muris Microfluidic data collection with the Tabula Muris FACS dataset as reference data for model training.
Bar plot indicating the predicted cell types organised by tissue types when the Tabula Muris Microfluidic dataset was used as query and the Tabula Muris FACS dataset was used as reference.
Heatmap of data in (A) comparing the original cell types given in the Tabula Muris Microfluidic data (rows) against the scClassify predicted cell types (columns) generated using the Tabula Muris FACS data as the reference dataset.
scClassify prediction results of cells in lung tissue type in the Tabula Muris FACS data by using Cohen *et al* dataset as reference (Cohen *et al*, 2018; Data ref: Cohen *et al*, 2018). The large tSNE plot is the full data coloured by scClassify predicted cell types, while the smaller tSNE panels show two subsets of cells from the large plot, each coloured to highlight four marker genes, where the lighter yellow colour represents higher gene expression.

See this image and copyright information in PMC

References

1. Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders M, Mahfouz A (2019) A comparison of automatic cell identification methods for single‐cell RNA sequencing data. Genome Biol 20: 194 - PMC - PubMed
1. Alquicira‐Hernandez J, Sathe A, Ji HP, Nguyen Q, Powell JE (2019) scPred: accurate supervised method for cell‐type classification from single‐cell RNA‐seq data. Genome Biol 20: 264 - PMC - PubMed
1. Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Wolters PJ, Abate AR et al (2019) Reference‐based analysis of lung single‐cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 20: 163 - PMC - PubMed
1. Bakken T, Cowell L, Aevermann BD, Novotny M, Hodge R, Miller JA, Lee A, Chang I, McCorrison J, Pulendran B et al (2017) Cell type discovery and representation in the era of high‐content single cell phenotyping. BMC Bioinformatics 18: 559 - PMC - PubMed
1. Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, Ryu JH, Wagner BK, Shen‐Orr SS, Klein AM et al (2016) A single‐cell transcriptomic map of the human and mouse pancreas reveals inter‐and intra‐cell population structure. Cell Syst. 3: 346–360 - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Associated data

Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO

Grants and funding

R21 DC015107/DC/NIDCD NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

scClassify: sample size estimation and multiscale classification of cells using single and multiple reference

Affiliations

scClassify: sample size estimation and multiscale classification of cells using single and multiple reference

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources