Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 19;24(2):bbad042.
doi: 10.1093/bib/bbad042.

A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis

Affiliations

A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis

Tao Deng et al. Brief Bioinform. .

Abstract

Feature gene selection has significant impact on the performance of cell clustering in single-cell RNA sequencing (scRNA-seq) analysis. A well-rounded feature selection (FS) method should consider relevance, redundancy and complementarity of the features. Yet most existing FS methods focus on gene relevance to the cell types but neglect redundancy and complementarity, which undermines the cell clustering performance. We develop a novel computational method GeneClust to select feature genes for scRNA-seq cell clustering. GeneClust groups genes based on their expression profiles, then selects genes with the aim of maximizing relevance, minimizing redundancy and preserving complementarity. It can work as a plug-in tool for FS with any existing cell clustering method. Extensive benchmark results demonstrate that GeneClust significantly improve the clustering performance. Moreover, GeneClust can group cofunctional genes in biological process and pathway into clusters, thus providing a means of investigating gene interactions and identifying potential genes relevant to biological characteristics of the dataset. GeneClust is freely available at https://github.com/ToryDeng/scGeneClust.

Keywords: cofunctional genes; feature relevance; feature selection; gene clustering; redundancy and complementarity; single-cell RNA-seq.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The workflow of two versions of GeneClust: GeneClust-ps and GeneClust-fast. The workflow includes two modules. (A) Pseudo-labeling of highly confident cells. This module is only for GeneClust-ps and includes two steps. Step 1: GMM-based consensus clustering. This step outputs a consensus confidence index matrix W. Step 2: clustering and pseudo-labelling of highly confident cells. Leiden algorithm is used to generate highly confident clusters from a weighted graph constructed on W. Cells from highly confident clusters are considered highly confident cells and pseudo-labelled with their cluster index. (B) Gene clustering and feature selection. Both GeneClust-ps and GeneClust-fast include this module. In GeneClust-ps, it consists of three steps. Step 1: filtering of low-relevance genes. Step 2: Gene clustering. Gene clusters are generated by partitioning the minimum spanning tree of a gene-gene graph whose edge weights equal to the redundancy of gene pairs. Step 3: feature selection. The most relevant gene (triangles with red edges) of each gene cluster is selected. GeneClust-fast consists of two steps. Step 1: mbkmeans-based gene clustering. Step 2: feature selection. For each gene cluster, a representative (nearest to cluster centroid) and a complementary gene (farthest to cluster centroid) are selected (triangles with red edges).
Figure 2
Figure 2
Comparison of GeneClust with competing FS methods in cell clustering on 12 scRNA-seq datasets, using Seurat V4 as clustering tool. (A) ARI values of clustering results for all methods in all datasets. The numbers of selected features by GeneClust are shown on top of each bar. All other FS methods select 2000 features. (B) Ranking of FS methods based on ARI values. In the heatmap on the left, y-axis represents methods, x-axis represent datasets, and darker colour represents higher ranking (better performance). The Box plot on the right shows the distribution of each FS method’s ranking across the 12 datasets.
Figure 3
Figure 3
GO enrichment and cofunctional analyses of top relevant genes clusters generated by GeneClustps on the PBMC dataset collected from SLE patients (PBMC-ctrl). (A) The 20 most significant GOBPs enriched in experimental group (genes from top relevant gene clusters) and control group (randomly selected genes) are represented by circles and triangles respectively. The x-axis represents the negative logarithm of adjusted P-values of the GOBPs’ enrichment significance. The y-axis represents the AUC values of the 20 GOBPs in the gene cofunction analysis. Blue colour indicates immune system/SLE-related GOBPs and red colour indicates immune system/SLE-unrelated GOBPs. (B) The enrichment map of the most significantly enriched 20 GOBPs in the experimental (left-hand side) and control groups (right-hand side). The connectivity of the network indicates functional associations among GOBPs. The node colour represents the significance (negative logarithm of adjusted P-value) of the GOBP in the GO enrichment analysis, where darker colour corresponds to less significant GOBP. The node size represents the number of genes involved in the GOBP.
Figure 4
Figure 4
KEGG pathway enrichment and cofunction analyses of top relevant genes clusters generated by GeneClust-ps on the PBMC dataset collected from SLE patients (PBMC-ctrl). (A) The 20 most significant KEGG pathways enriched in experimental group (genes from top relevant gene clusters) and control group (randomly selected genes) are represented by circles and triangles respectively. The x-axis represents the negative logarithm of adjusted P-values of the pathways’ enrichment significance. The y-axis represents the AUC values of the 20 pathways in the gene cofunction analysis. Green colour indicates SLE-related pathways and yellow colour indicates SLE-unrelated pathways. (B) The enrichment map of the most significantly enriched 20 pathways in the experimental (left-hand side) and control groups (right-hand side). The connectivity of the network indicates functional associations among pathways. The node colour represents the significance (negative logarithm of adjusted P-value) of the pathway in the KEGG enrichment analysis, where darker colour corresponds to less significant pathway. The node size represents the number of genes involved in the pathway.
Figure 5
Figure 5
Cell-type-specific expression changes of interferon-β-induced DEGs in the interferon-β-treated PBMC dataset collected from SLE patients (PBMC-stim). Each row represents a DEG in top relevant gene clusters formed by GeneClust-ps (left figure) or a DEG in randomly selected genes (right figure). Each column represents a PBMC cell type including natural killer cell (NK), CD8+ T cell (Tc), CD4+ T cell (Th), B cell (B), dendritic cells (DC), CD14+CD16+ monocytes (ncMono), CD14+CD16- monocytes (cMono), and megakaryocytes (Mkc). Colours of the squares represent the log2 fold change (log2 FC) of a DEG in a cell type in response to interferon-β stimulation, with red colour indicating significant upregulation (log2 FC > 1, FDR < 0.05) and blue colour downregulation (log2 FC < −1, FDR < 0.05). DEGs names (listed at the right side of the figures) highlighted in purple are involved in well-documented pathways relevant to cell-type-specific responses to interferon-βstimulation. DEGs names highlighted in green are not involved in the above pathways but reported to be interferon-β responsive.

References

    1. Durante MA, Kurtenbach S, Sargi ZB, et al. Single-cell analysis of olfactory neurogenesis and differentiation in adult humans. Nat Neurosci 2020;23:323–6. - PMC - PubMed
    1. Kinker GS, Greenwald AC, Tal R, et al. Pan-cancer single-cell RNA-seq identifies recurring programs of cellular heterogeneity. Nat Genet 2020;52:1208–18. - PMC - PubMed
    1. Galen P, Hovestadt V, WadsworthMH, II, et al. Single-cell RNA-Seq reveals AML hierarchies relevant to disease progression and immunity. Cell 2019;176:1265–81. - PMC - PubMed
    1. Kiselev VY, Kirschner K, Schaub MT, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 2017;14:483–6. - PMC - PubMed
    1. Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of multimodal single-cell data. Cell 2021;184:3573–87. - PMC - PubMed

Publication types

MeSH terms