Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 8;24(1):101913.
doi: 10.1016/j.isci.2020.101913. eCollection 2021 Jan 22.

CellO: comprehensive and hierarchical cell type classification of human cells with the Cell Ontology

Affiliations

CellO: comprehensive and hierarchical cell type classification of human cells with the Cell Ontology

Matthew N Bernstein et al. iScience. .

Abstract

Cell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification of cell clusters by considering the rich hierarchical structure of known cell types. Furthermore, CellO comes pre-trained on a comprehensive data set of human, healthy, untreated primary samples in the Sequence Read Archive. CellO's comprehensive training set enables it to run out of the box on diverse cell types and achieves competitive or even superior performance when compared to existing state-of-the-art methods. Lastly, CellO's linear models are easily interpreted, thereby enabling exploration of cell-type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO's models across the ontology.

Keywords: Classification of Bioinformatical Subject; Genomic Analysis; Genomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
Overview of CellO (A) A schematic overview of CellO's hierarchical classification approach. CellO performs hierarchical classification with the Cell Ontology. Given a gene expression profile, CellO annotates the cell with a set of cell types (gray nodes) that are consistent with the hierarchical structure of the Cell Ontology. (B) We compare CellO to eight recent cell type annotation methods regarding the criteria we surmise are desirable in a cell type classification approach: whether the method (1) arrives pre-trained and can run out of the box, (2) incorporates a hierarchy of cell types, (3) specifically uses the Cell Ontology as its hierarchy, (4) requires cell-type-specific marker genes, (5) uses a model that can be interrogated to better understand how it arrived at its decision, and (6) whether the method operates on clusters or single cells. We compare CellO to scMatch (Hou et al. 2019), SingleR (Aran et al., 2019), scCatch (Shao et al., 2020), CHETAH (de Kanter et al., 2019), Garnett (Pliner et al. 2019), CellAssign (Zhang et al., 2019a), ACTINN (Ma and Pellegrini 2020), scPred (Alquicira-Hernandez et al., 2019), CaSTLe (Lieberman et al. 2018), and SingleCellNet (Tan and Cahan 2019). CellO meets more desirable criteria than existing methods. (C) Euler diagrams of the cell types within the bulk RNA-seq expression profiles used to train CellO. This training set comprises most primary cell bulk RNA-seq samples within the SRA and consists of diverse cell types spanning various tissues, developmental stages, and stages of differentiation. These diagrams were created with nVenn (Pérez-Silva et al. 2018).
Figure 2
Figure 2
Overview of analyses and CellO's algorithm (A) A schematic illustration of the data sets and analyses performed in this study. Initial candidate bulk RNA-seq samples were selected from the SRA via the MetaSRA, filtered for errors, and quantified using the kallisto algorithm (Bray et al., 2016), which resulted in a comprehensive bulk RNA-seq training set consisting of healthy, human primary cells. This training set was split into a pre-training and validation set for tuning the parameters of the binary classifiers, as well as for evaluating the graph correction methods (Transparent Methods). The full bulk RNA-seq data set was then used to train the final models that were then evaluated on three sets of scRNA-seq data. The first set consisted of an aggregation of diverse non-droplet-based data sets from the SRA. The second data set consisted of FAC-sorted PBMCs from Zheng et al. (2017). The third set consisted of primary lung tumor cells from Laughney et al. (2020). (B) A schematic illustration of CellO's classification procedure. First, for a given sample, the raw classifier probabilities are corrected with the cell ontology using IR (if CLR is used, this step is not necessary). We illustrate one edge of the graph whose incident nodes have probabilities that are logically inconsistent with the hierarchy and thus require correction because the child node has a higher probability than the parent. Once corrected, cell types whose raw probabilities meet their respective decision threshold are selected. Among these, the most specific cell types (i.e., lowest in the ontology) are examined and the cell type with the highest output probability is selected. CellO outputs this final selected cell type along with all ancestor terms.
Figure 3
Figure 3
Reconciling the outputs of independent classifiers with a hierarchy (A) Average precision scores across all cell types for the independent classifiers (Ind.), as well as for IR, TPR, and BNC on the validation set. (B) Each paired sample and cell type prediction was considered independently. The set of all such predictions was ordered according to their prediction probability and the corresponding precision-recall curve was constructed for the independent classifiers, IR, TPR, and BNC.
Figure 4
Figure 4
Results on non-droplet-based single-cell data CellO's performance on the 4,936 non-droplet-based cells considering only cells whose cell types are present in the bulk RNA-seq training set. We compare the distributions of (A) F1-score, (B) precision, and (C) average precision across all such cell types. (D) The subgraph spanning the non-droplet-based cells where each cell type is colored according to CellO's (IR) F1-score (top) as well as by average precision (bottom).
Figure 5
Figure 5
Comparison of CellO to existing approaches on non-droplet-based single-cell data Evaluating CellO, SingleR, and scMatch on the non-droplet-based cells. (A) The fraction of cell types in the single-cell test data set that are also present in each method's training set. IR and CLR are not shown separately because they share the same training set. We evaluate SingleR's built in reference sets from the Human Primary Cell Atlas (HPCA) and BluePrint + ENCODE (BE). (B) The distribution of both F1-scores (left) and precisions (right) for only those cell types that are in each method's training set. We compare CellO to scMatch, SingleR with the Human Primary Cell Atlas (HPCA), and SingleR with the Blueprint + ENCODE reference (BE). Note each distribution evaluates different sets of cell types depending on the particular subset of cell types present in each method's training set.
Figure 6
Figure 6
Results on 10x PBMC data (A and B) (A) The subgraph of the Cell Ontology spanning the 10x PBMC data set from Zheng et al. (2017). Each cell type is colored according to CellO's (IR) F1-score, as well as (B) average precision. (C) UMAP plots of the single-cell data set where cells are colored by their true cell type (top), as well as the most specific predicted cell type (i.e. lowest in the ontology) as output by CellO (bottom). (D) Boxplots displaying the distribution of F1-scores across all cell types for IR, CLR, 1NN, scMatch, SingleR with the Human Primary Cell Atlas (HPCA), SingleR with the Blueprint + ENCODE reference (BE), and SingleR with the Monaco et al. reference (M).
Figure 7
Figure 7
Examination of CellO's performance on difficult data sets (A) UMAP plots of all healthy cells in Segerstolpe et al. (2016) including cells for which their specific cell types are not present in CellO's bulk RNA-seq training set. Cells are colored according to their true cell type (left) and (IR) predicted cell type (right). Highlighted are CellO's predictions made on pancreatic acinar cells (top right ovals), as well as a subset of uncharacterized pancreatic epithelial cells predicted as A cells (center ovals). (B) UMAP plots of human, embryonic neural cells from La Manno et al. (2016). Cells are colored according to their true cell type (left) and predicted cell type (right). Highlighted are CellO's predictions made on both the microglial and glial cells and note that CellO annotates these cells using terms that are higher in the ontology's graph than their true terms.
Figure 8
Figure 8
Examination of CellO on diseased cells (A) UMAP plots of lung adenocarcinoma tumor LX675 from Laughney et al. (2020) colored by CellO's output using IR and a Leiden resolution parameter of 1.0 (left) and the original cell type labels provided by the authors. We highlight four subpopulations comprising putative CD1C + myeloid dendritic cells (top left), endothelial cells (top right), plasma cells (bottom left), and mast cells (bottom right). (B) The legend for coloring cells in (A). (C) UMAP plots of cells colored by their expression, in units log(TPM+1), of CD1C, a marker for CDC1+ myeloid dendritic cells, PECAM1, a marker for endothelial cells, SDC1, a marker for plasma cells, and KIT, a marker for mast cells.
Figure 9
Figure 9
The CellO Viewer Screenshots of the CellO Viewer web application for enabling the exploration of cell-type-specific expression signatures across the Cell Ontology. (A) Comparing the top ten genes between CD4+ T cells and CD8+ T cells (red nodes in the Graph View) ranked by the magnitude of their coefficients in their corresponding models. Genes that are shared between the two lists are highlighted with the same color. The CellO Viewer displays genes whose expressions are both positively correlated (green) and negatively correlated (red) with the selected cell types. (B) A screenshot of the gene-centric mode of the CellO Viewer with GFAP, an astrocyte marker, selected. For a given selected gene, the CellO Viewer will display the cell types within the DAG (top) and in list form (bottom) for which the selected gene appears within the top ten genes ranked by each model's coefficients.

References

    1. Abdelaal T., Michielsen L., Cats D., Hoogduin D., Mei H., Marcel J., Reinders T., Ahmed M. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019;20:194. - PMC - PubMed
    1. Alquicira-Hernandez J., Anuja S., Ji H.P., Nguyen Q., Joseph E., Powell scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 2019;20:264. - PMC - PubMed
    1. Aran D., Hu Z., Butte A.J. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 2017;18:220. - PMC - PubMed
    1. Aran D., Looney A.P., Liu L., Wu E., Fong V., Hsu A., Chak S., Naikawadi R.P., Wolters P.J., Abate A.R. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 2019;20:163–172. - PMC - PubMed
    1. Arendt D., Musser J.M., Baker C.V.H., Bergman A., Cepko C., Erwin D.H., Pavlicev M., Schlosser G., Widder S., Laubichler M.D. The origin and evolution of cell types. Nat. Rev. Genet. 2016;17:744–757. - PubMed

LinkOut - more resources