. 2023 Jan 10;18(1):113-130.

doi: 10.1016/j.stemcr.2022.10.015. Epub 2022 Nov 17.

A set-theoretic definition of cell types with an algebraic structure on gene regulatory networks and application in annotation of RNA-seq data

Yuji Okano¹, Yoshitaka Kase¹, Hideyuki Okano²

Affiliations

¹ Department of Physiology, Keio University School of Medicine, 35, Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan.
² Department of Physiology, Keio University School of Medicine, 35, Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan. Electronic address: hidokano@keio.jp.

PMID: 36400029
PMCID: PMC9859932
DOI: 10.1016/j.stemcr.2022.10.015

A set-theoretic definition of cell types with an algebraic structure on gene regulatory networks and application in annotation of RNA-seq data

Yuji Okano et al. Stem Cell Reports. 2023.

. 2023 Jan 10;18(1):113-130.

doi: 10.1016/j.stemcr.2022.10.015. Epub 2022 Nov 17.

Authors

Yuji Okano¹, Yoshitaka Kase¹, Hideyuki Okano²

Affiliations

¹ Department of Physiology, Keio University School of Medicine, 35, Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan.
² Department of Physiology, Keio University School of Medicine, 35, Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan. Electronic address: hidokano@keio.jp.

PMID: 36400029
PMCID: PMC9859932
DOI: 10.1016/j.stemcr.2022.10.015

Abstract

The emergence of single-cell RNA sequencing (RNA-seq) has radically changed the observation of cellular diversity. Although annotations of RNA-seq data require preserved properties among cells of an identity, annotations using conventional methods have not been able to capture universal characters of a cell type. Analysis of expression levels cannot be accurately annotated for cells because differences in transcription do not necessarily explain biological characteristics in terms of cellular functions and because the data themselves do not inform about the correct mapping between cell types and genes. Hence, in this study, we developed a new representation of cellular identities that can be compared over different datasets while preserving nontrivial biological semantics. To generalize the notion of cell types, we developed a new framework to manage cellular identities in terms of set theory. We provided further insights into cells by installing mathematical descriptions of cell biology. We also performed experiments that could correspond to practical applications in annotations of RNA-seq data.

Keywords: annotation; cell type; cellular state; mathematical model; scRNA-seq; set theory; transcriptome.

PubMed Disclaimer

Conflict of interest statement

Conflict of interests H.O. is a compensated scientific consultant for San Bio Co., Ltd.; RMiC; and K Pharma, Inc.

Figures

**Figure 1**
The GEN as a universal model of causality for gene expression and the possessed cascades as observed relationships of the genes (A) Schematic of the GEN. The vertexes denote genes, and the edges denote causality. A latent variable $ε$ for extrinsic factors is present in the center of the illustration. (B) $ε$ as the Cartesian product of the exogenous variables for the corresponding vertex in the causal graphical model. (C) Causality blockade altering the relationship between two variables. When the variable y is unobserved, the path between x and z remains (i.e., x and z are mutually dependent on each other). In contrast, when the observation fixes the y value, the path between x and z is blocked (i.e., the two variables are mutually independent). (D) Graphical explanation of the GEN and a GRN.

**Figure 2**
Mathematical operations required to simplify the GRN (A) A conceptual illustration of excluding genes from the GRN. (B) Graphs of the genes of interest are induced subgraphs of the graph featuring all genes. (C) An operation to remove genes from GRNs, which is equivalent to treating the genes as exogenous variables. (D) Cascades and vertexes treated as exogenous variables. Shaded areas are neglected in the simplified GRN. (E) Technical difficulties of data-driven causal exploration. (F) Effect of conversion from directed to undirected edges. (G) Comparison of cellular identities using undirected GRNs.

**Figure 3**
Topological spaces of cellular similarity (A) Standard workflow of scRNA-seq analysis. (B) Using the same metric to compare single cells and cell classes, which is nontrivial. (C) Morphism that maps single cells to the corresponding GRNs. (D) Structural similarity of single cells’ GRNs, which is equivalent to the Hamming distance for strings of 0 and 1. (E–G) Quotient spaces of cell classes. The shaded areas in $S$ are cell classes, and those in $P (C (F))$ are images of $φ$ pertaining to the cell classes. (H) Graphical explanation of the statement regarding the choice of representative GRNs from a cell class. (I) The GRN of a cell class, approximately regarded as the intersection GRN for all single cells in the cell class. (J) The inclusion relations of eigen-cascades, indicating the structural similarity of GRNs in the case of cell classes.

**Figure 4**
Overview of analyses used in this research (A) Workflow of GRN-based cell-class annotation, using FA and an ML model. (B) Standard workflow of the conventional annotation using DEGs. (C) Overview of data splitting. (D) Schematics of the three GRN-based annotation methods: inference, labeling, and estimation.

**Figure 5**
A GBDT model to identify features with which to classify cell types in m1_10x (A) A scatterplot of the uniform manifold approximation and projection (UMAP) manifold on expression patterns of the 90 marker genes in m1_10x. The marker colors denote cell types given in the metadata. (B) Sample ratio transition during resampling and data splitting. (C and D) SD among the group-wise mean values of gene expression. (E) Subclass information given in the metadata of m1_10x. (F) Top 10 genes (in terms of the median of FI) during 5-fold hyperparameter tuning. (G and H) ROC curve and PR curve of the GBDT model with the top 1,000 genes (in terms of SD among the group-wise means). AUC, AP, and the macro or micro averages thereof were also calculated. (I and J) The evaluation metrics for the GBDT model with *GAD1* and *GRIP1*.

**Figure 6**
and k-means in GSE165388 (A–D) The parallel analyses used to determine the number of factors in gw9–gw12. (E–H) Heatmaps of factor loadings after elimination of factors with maximum loadings smaller than 0.5. Quartimin rotation was performed for all models. (I–L) Silhouette plots for optimal k values. (M−P) Results of k-means clustering in UMAP.

**Figure 7**
GRN-based annotation in GSE165388 (A) GRNs of cell types in m1_10x to be used in labeling (as the referential data). (B–E) GRNs of all clusters in gw9–gw12. All vertexes and edges used in labeling or estimation are indicated. (F–H) “Planet plots” showing the labeling results. The referential cell classes are shown in the center, and the circles’ radii denote values of $d^{*}$ . The cell classes on the innermost circles are considered to be most similar to those in the center. (I) Comparison of DEG- and GRN-based annotation methods.

See this image and copyright information in PMC

Cited by

Multi-organ frailty is enhanced by periodontitis-induced inflammaging.
Kase Y, Morikawa S, Okano Y, Hosoi T, Yasui T, Taki-Miyashita Y, Yakabe M, Goto M, Ishihara K, Ogawa S, Nakagawa T, Okano H. Kase Y, et al. Inflamm Regen. 2025 Feb 3;45(1):3. doi: 10.1186/s41232-025-00366-5. Inflamm Regen. 2025. PMID: 39894806 Free PMC article.

References

1. Aalto A., Viitasaari L., Ilmonen P., Mombaerts L., Gonçalves J. Gene regulatory network inference from sparsely sampled noisy data. Nat. Commun. 2020;11:3493. - PMC - PubMed
1. Allen P.J. A fundamental theorem of homomorphisms for semirings. Proc. Am. Math. Soc. 1969;21:412–416.
1. Ankan A., Panda A. Proc. 14th Python Sci. Conf. 2015. Pgmpy: probabilistic graphical models using Python. - DOI
1. Bookstein A., Kulyukin V.A., Raita T. Generalized hamming distance. Inf. Retr. Boston. 2002;5:353–375.
1. Cheng J., Bell D.a., Liu W. An algorithm for Bayesian belief network construction from data. Proc. Mach. Learn. Res. 1997:83–90.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A set-theoretic definition of cell types with an algebraic structure on gene regulatory networks and application in annotation of RNA-seq data

Affiliations

A set-theoretic definition of cell types with an algebraic structure on gene regulatory networks and application in annotation of RNA-seq data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources