Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May 1;27(R1):R40-R47.
doi: 10.1093/hmg/ddy100.

Cell type discovery using single-cell transcriptomics: implications for ontological representation

Affiliations

Cell type discovery using single-cell transcriptomics: implications for ontological representation

Brian D Aevermann et al. Hum Mol Genet. .

Abstract

Cells are fundamental function units of multicellular organisms, with different cell types playing distinct physiological roles in the body. The recent advent of single-cell transcriptional profiling using RNA sequencing is producing 'big data', enabling the identification of novel human cell types at an unprecedented rate. In this review, we summarize recent work characterizing cell types in the human central nervous and immune systems using single-cell and single-nuclei RNA sequencing, and discuss the implications that these discoveries are having on the representation of cell types in the reference Cell Ontology (CL). We propose a method, based on random forest machine learning, for identifying sets of necessary and sufficient marker genes, which can be used to assemble consistent and reproducible cell type definitions for incorporation into the CL. The representation of defined cell type classes and their relationships in the CL using this strategy will make the cell type classes being identified by high-throughput/high-content technologies findable, accessible, interoperable and reusable (FAIR), allowing the CL to serve as a reference knowledgebase of information about the role that distinct cellular phenotypes play in human health and disease.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Identification of necessary and sufficient marker genes using NSforest. (A) A typical single-cell/single-nuclei RNA sequencing workflow in which a tissue specimen is obtained, single cells/nuclei isolated by fluorescence-activated cell sorting, amplified cDNA processed by sequencing and cell types identified by clustering the resultant transcriptional profiles. (B) The NSforest approach takes a data matrix of expression values (e.g. transcripts per million reads) of genes (rows) in single cell/nuclei samples (columns) grouped by cell type cluster membership. In the first step, the expression levels of genes are used as features in the random forest machine learning procedure to train classification models comparing single cell/nuclei expression data in one cell type cluster against single cell/nuclei expression data in all other clusters, for every cell type cluster separately, using a Random Forest Learner like KNIME v3.1.2. Each cell type cluster classification model is constructed from a collection of trees (e.g. 1000 trees) using information gain ratio as the splitting criteria, where each decision tree is generated using the specific bagging parameters (e.g. the square root of the number of features and a bootstrap of samples equal to the training set size). For each cell type cluster classification model, the method outputs usage statistics, including how often each gene is used as a branching criterion and the number of times it was a candidate across all random decision trees. By summing the frequency of use when available as a candidate feature along the first three branching levels, the list of genes can be ranked by their usefulness in distinguishing one cell type cluster from the other clusters. In the second step, single decision trees are constructed using the first gene from the ranked list, the first two genes, the first three genes, etc. Each individual tree is then assessed for classification accuracy and tree topology using the training data. Given the objective of determining the necessary and sufficient marker genes, we apply additional criteria in scoring the trees—we restrict each gene to being used in only one branch per tree, and find the optimal classification for the target cluster only, rather than the overall classification score. The addition of genes from the ranked list is stopped when an optimal classification or stable tree topology is achieved. The minimum number of genes used to produce this optimal result corresponds to the set of necessary and sufficient marker genes required to define the cell type cluster.
Figure 2.
Figure 2.
Marker gene expression patterns in single nuclei grouped by cluster. A heatmap of expression levels for the necessary and sufficient marker genes identified for all 16 clusters across all single nuclei grouped by cell type cluster is shown, including 1 excitatory (e1), 11 inhibitory (i1–i11) and 4 glial (g1–g4) cell type clusters. In total, 49 markers genes were selected as being necessary and sufficient to distinguish these 16 different cell type clusters from cortical layer 1/2 of the human brain MTG region.
Figure 3.
Figure 3.
Formal rosehip neuron definition using logical axioms. A set of logical axioms about the anatomical location of the cell body (soma), the functional capacity and the necessary and sufficient marker gene expressions are combined to construct an equivalent class cell type definition for the rosehip neuron interneuron cluster—i5 (see for more information about how this cell type was characterized).

References

    1. Tang F., Barbacioru C., Wang Y., Nordman E., Lee C., Xu N., Wang X., Bodeau J., Tuch B.B., Siddiqui A.. et al. (2009) mRNA-Seq whole transcriptome analysis of a single cell. Nat. Methods, 6, 377–382. - PubMed
    1. Tang F., Barbacioru C., Bao S., Lee C., Nordman E., Wang X., Lao K., Surani M.A. (2010) Tracing the derivation of embryonic stem cells from the inner cell mass by single-cell RNA-Seq analysis. Cell Stem Cell, 6, 468–478. - PMC - PubMed
    1. Bard J., Rhee S.Y., Ashburner M. (2005) An ontology for cell types. Genome Biol., 6, R21.. - PMC - PubMed
    1. Bakken T., Cowell L., Aevermann B.D., Novotny M., Hodge R., Miller J.A., Lee A., Chang I., McCorrison J., Pulendran B.. et al. (2017) Cell type discovery and representation in the era of high-content single cell phenotyping. BMC Bioinformatics, 18, 559.. - PMC - PubMed
    1. Zeisel A., Muñoz-Manchado A.B., Codeluppi S., Lönnerberg P., La Manno G., Juréus A., Marques S., Munguba H., He L., Betsholtz C.. et al. (2015) Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 347, 1138–1142. - PubMed

Publication types