Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 11;50(1):46-56.
doi: 10.1093/nar/gkab1132.

MarkovHC: Markov hierarchical clustering for the topological structure of high-dimensional single-cell omics data with transition pathway and critical point detection

Affiliations

MarkovHC: Markov hierarchical clustering for the topological structure of high-dimensional single-cell omics data with transition pathway and critical point detection

Zhenyi Wang et al. Nucleic Acids Res. .

Abstract

Clustering cells and depicting the lineage relationship among cell subpopulations are fundamental tasks in single-cell omics studies. However, existing analytical methods face challenges in stratifying cells, tracking cellular trajectories, and identifying critical points of cell transitions. To overcome these, we proposed a novel Markov hierarchical clustering algorithm (MarkovHC), a topological clustering method that leverages the metastability of exponentially perturbed Markov chains for systematically reconstructing the cellular landscape. Briefly, MarkovHC starts with local connectivity and density derived from the input and outputs a hierarchical structure for the data. We firstly benchmarked MarkovHC on five simulated datasets and ten public single-cell datasets with known labels. Then, we used MarkovHC to investigate the multi-level architectures and transition processes during human embryo preimplantation development and gastric cancer procession. MarkovHC found heterogeneous cell states and sub-cell types in lineage-specific progenitor cells and revealed the most possible transition paths and critical points in the cellular processes. These results demonstrated MarkovHC's effectiveness in facilitating the stratification of cells, identification of cell populations, and characterization of cellular trajectories and critical points.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of MarkovHC. (A) MarkovHC simultaneously performs hierarchical clustering, transition path tracking, and critical points detecting. (B) The intuitive idea behind MarkovHC. (C) The workflow of MarkovHC: (1) The original input data is the matrix of genes by cells. (2) We calculate sNN (shared Nearest Neighbours) among cells to get the cell by cell similarity matrix. Then we construct a cellular network using the similarity matrix and calculate each cell's degree (D scores) in the network. (3) The Markov transition matrix is calculated using the similarity matrix and D scores. (4) The pseudo-energy matrix is calculated based on the Markov transition matrix. (5) The hierarchical structure is constructed based on attractors, basins, and critical points on each level.
Figure 2.
Figure 2.
MarkovHC stratifies and clusters cells in agreement with known identities. (A) These 2-dimensional basins and attractors (red) found by MarkovHC are consistent with the topology. (B) The hierarchy from Lv.24 to Lv.27 of basins in (A). The sizes of basins represent the number of samples and the colors indicate different basins. (C) These 1000 cells × 5000 genes data were projected into 3-dimensional space by principal component analysis. Three basins were clustered (brown, green, and blue; the purple points are critical points; the yellow arrow shows the path from basin1 to basin 3). (D) scRNA-Seq data (40) of 1018 human ES cell-derived lineage-specific progenitors were projected into 2-dimensional space by phateR. (E) Basins from Lv.5 to Lv.8 reveal known cell types and sub-basins in the neuronal progenitor cells and H1/H9 ES cells. (F) From the bottom to the top, levels of the hierarchical structure correspond to cell types, cell states, and cell lineages. (G) ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information) show MarkovHC performed equal to or better than these methods in clustering.
Figure 3.
Figure 3.
MarkovHC revealed transition paths and critical points in human preimplantation embryo development. (A) The scRNA-Seq data (63) of 1529 human preimplantation embryos cells from the E3 stage to the E7 stage were projected into 3-dimensional space by phateR. E3–E7 indicates the embryonic day. E4.late and E5.early indicate cells picked 4–6 hours later and earlier than that in the E4 stage and the E5 stage, respectively. (B) Ten basins on Lv.12 correspond to ten asynchronous development stages in human preimplantation embryos cells. (C) The cellular hierarchy from Lv.12 to Lv.20. (D) The heatmap of the top 50 DEGs and enriched GO terms per basin. (E) Four main cell types with sub-populations which are 8-cell embryo, morula cell, ICM (inner cell mass), and TE (trophectoderm) were identified according to marker genes expression. (F) The transition path (yellow arrow) from the 8-cell embryo to ICM was tracked. The yellow points indicate cells along the path and the purple points indicate the critical points from morula cells to ICM. (G) DEGs along the transition path in (F). (H, I) Important marker genes show increasing and decreasing ‘gene-flow’ trends along the path. Gene expression varies dramatically around critical points (purple points). (J) The inferred development hierarchy is consistent with the ground truth of the development hierarchy (in the lower right corner).
Figure 4.
Figure 4.
MarkovHC detected critical points from MSCs to gastric cancer cells. (A) 831 mesenchymal stem cells (MSCs) and 695 MSC-origin early gastric cancer cells (EGCs) (67) were projected into 2-dimensional space by UMAP. (B) MarkovHC found two basins on Lv.21. (C) Five basins were clustered and two transition paths from MSCs to EGCs were inferred by MarkovHC. Purple points indicate critical points on the transition paths. (D) The inferred transitions among basins in (C). (E) The heatmap and enriched GO terms of the top 50 DEGs per basin in (C). (F, G) DEGs along Path1 and Path2 in (C). (H-I) OLFM4 and CEACAM6 showed opposite ‘gene-flow’ trends along Path1 (H). SOX4 and NEAT1 showed opposite ‘gene-flow’ trends along Path2 (I). The expression values of these genes dramatically changed around the critical points.

References

    1. Butler A., Hoffman P., Smibert P., Papalexi E., Satija R.. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018; 36:411–420. - PMC - PubMed
    1. Kiselev V.Y., Kirschner K., Schaub M.T., Andrews T., Yiu A., Chandra T., Natarajan K.N., Reik W., Barahona M., Green A.R.et al. .. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods. 2017; 14:483–486. - PMC - PubMed
    1. Wang B., Zhu J., Pierson E., Ramazzotti D., Batzoglou S.. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods. 2017; 14:414–416. - PubMed
    1. Lin P., Troup M., Ho J.W.. CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 2017; 18:59. - PMC - PubMed
    1. Zurauskiene J., Yau C.. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics. 2016; 17:140. - PMC - PubMed

Publication types