. 2023 Sep 22;24(6):bbad370.

doi: 10.1093/bib/bbad370.

Robust discovery of gene regulatory networks from single-cell gene expression data by Causal Inference Using Composition of Transactions

Abbas Shojaee¹, Shao-Shan Carol Huang¹

Affiliations

PMID: 37897702
PMCID: PMC10612495
DOI: 10.1093/bib/bbad370

Robust discovery of gene regulatory networks from single-cell gene expression data by Causal Inference Using Composition of Transactions

Abbas Shojaee et al. Brief Bioinform. 2023.

. 2023 Sep 22;24(6):bbad370.

doi: 10.1093/bib/bbad370.

Authors

Abbas Shojaee¹, Shao-Shan Carol Huang¹

Affiliation

¹ Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY 10003, USA.

PMID: 37897702
PMCID: PMC10612495
DOI: 10.1093/bib/bbad370

Abstract

Gene regulatory networks (GRNs) drive organism structure and functions, so the discovery and characterization of GRNs is a major goal in biological research. However, accurate identification of causal regulatory connections and inference of GRNs using gene expression datasets, more recently from single-cell RNA-seq (scRNA-seq), has been challenging. Here we employ the innovative method of Causal Inference Using Composition of Transactions (CICT) to uncover GRNs from scRNA-seq data. The basis of CICT is that if all gene expressions were random, a non-random regulatory gene should induce its targets at levels different from the background random process, resulting in distinct patterns in the whole relevance network of gene-gene associations. CICT proposes novel network features derived from a relevance network, which enable any machine learning algorithm to predict causal regulatory edges and infer GRNs. We evaluated CICT using simulated and experimental scRNA-seq data in a well-established benchmarking pipeline and showed that CICT outperformed existing network inference methods representing diverse approaches with many-fold higher accuracy. Furthermore, we demonstrated that GRN inference with CICT was robust to different levels of sparsity in scRNA-seq data, the characteristics of data and ground truth, the choice of association measure and the complexity of the supervised machine learning algorithm. Our results suggest aiming at directly predicting causality to recover regulatory relationships in complex biological networks substantially improves accuracy in GRN inference.

Keywords: causal inference; gene regulatory network; network inference; single-cell RNA sequencing; systems inference.

PubMed Disclaimer

Figures

**Figure 1**
Constructing CICT features. (A) A gene expression matrix is converted to a gene–gene association matrix. (B) CICT calculates confidence and contribution values for each edge. (C) CICT constructs the and zones for the source node and target node of an edge. (D) Mean and median are shown as examples of F1 functions on the empirical distribution of the contribution zone of the source node (). (E) CICT computes characteristics of the source and target distribution zones as edge features. F0 features are the Z-score of the source–target edge in each of the distribution zones. F1 features are calculated from distributions surrounding the source and target nodes of each edge. For each new distributional feature from F0 and F1 operations, CICT calculates the Z-score of each edge across all edges to make F2 features. All edge features built from F0, F1 and F2 operations are concatenated to form a final feature matrix for all gene pairs.

formula image — **Figure 1**
Constructing CICT features. (A) A gene expression matrix is converted to a gene–gene association matrix. (B) CICT calculates confidence and contribution values for each edge. (C) CICT constructs the and zones for the source node and target node of an edge. (D) Mean and median are shown as examples of F1 functions on the empirical distribution of the contribution zone of the source node (). (E) CICT computes characteristics of the source and target distribution zones as edge features. F0 features are the Z-score of the source–target edge in each of the distribution zones. F1 features are calculated from distributions surrounding the source and target nodes of each edge. For each new distributional feature from F0 and F1 operations, CICT calculates the Z-score of each edge across all edges to make F2 features. All edge features built from F0, F1 and F2 operations are concatenated to form a final feature matrix for all gene pairs.

**Figure 2**
Pipeline of benchmarking CICT and metrics of performance evaluation. (A) Benchmarking steps using SERGIO simulated scRNA-seq data for results presented in Figure 3. (B) Benchmarking steps using experimental scRNA-seq datasets. For each dataset, a small subset of ground truth was used to create the learning set, which was then used to train and test CICT supervised model (steps in the orange path on the left). The trained models were applied to predict all edges (steps in the red path in the middle), producing CICT results for performance evaluation in Figure 4 and time and memory requirements in Table S6. Bootstrapping of up to 50 000 unseen samples were used for sensitivity analysis (steps in the dark blue path on the right) reported in Figure 5. (C) Contingency matrix and the definition of precision, recall and specificity. In imbalanced data scenarios with large true negative class precision is more informative than specificity to evaluate binary classifiers [78]. (D) Definition of pAUPR as an integral of precision values over a range of recall values up to a specified threshold . (E) Example of precision-recall curve of a CICT predicted network. The highlighted area shows the pAUPR for up to 20% recall, which usually captures the highest precision of the classifier.

**Figure 3**
Benchmarking CICT and GRN inference methods with SERGIO simulated scRNA-seq datasets. The four horizontal sections (I, II, III, IV) indicate the number of cell types and differentiation trajectories simulated. In each section, each row of the boxplots and dot plots shows the results of one benchmarked method on 15 simulated expression data instances. For each dataset, the rpAUPR of all edges (A) and the rEPR of TF outgoing edges (B) were shown for the networks inferred from complete datasets or from datasets with dropouts. See Figure S1 and Table S3 for complete results.

**Figure 4**
Benchmarking CICT and GRN inference methods with experimental scRNA-seq datasets. The networks inferred by each algorithm was evaluated on the available ground-truth networks and the performance measures were compared to that of a random classifier to calculate rpAUPR of all potential edges (A) and rEPR of TF outgoing edges (B). The three sections, from left to right, correspond to the three levels of evaluation created by filtering the input datasets. Each row in the heatmaps corresponds to one scRNA-seq dataset and the specified type of ground truth. The four columns to the left of the heatmaps report network and expression data statistics. The next nine columns report the ratio of performance measures to the random classifier for the three supervised methods (CICT, DEEPDRIM and Inferelator-Prior) and the best performing unsupervised methods in alphabetical order. See Figure S3 and Table S3 for complete results.

**Figure 5**
Analysis of CICT sensitivity to preprocessing, modeling parameters and dataset characteristics using L2 datasets. Each panel shows a forest plot of the mean and 95% confidence interval (CI) of standardized pAUPR in the first two columns and a forest plot of the mean and 95% CI of rpAUPR in the third and fourth columns. The results reported were from a sample of up to 50 000 unseen edges that were not used in training or validation (see Methods and Figure 2B). (A) The effect of the number of true positive edges in the learning set sampled from the cell-type-specific ChIP-seq ground truth in L2 benchmarking datasets (L2_cs). (B) Comparison of directed versus undirected ground truth for L2_cs. (C) Comparison between types of ground truths. (D) The effect of using different gene-gene association measures across L2_cs datasets with 250 directed edges for training. (E) Assessment of the stability of CICT results in three runs of 50 independent random samples of 250 true positive edges for training and up to 50 000 random edges for testing in L2_cs datasets. (F) Aggregated results of benchmarked GRN inference methods on L2_cs.

**Figure 6**
Characterizing CICT decisions for predicting regulatory edges using L2 evaluation of mESC dataset with the lof/gof ground truth. (A) The top five CICT predictor features of regulatory relationships in mESC network and their relative importance. (B) Density plot of values of a top CICT feature (in log2 scale) from the regulatory edges and random edges. (C) Violin plot of values of the top five CICT features for regulatory edges (red violins on left) and random edges (blue violins on right). Values plotted are log2 frequency of the binned original values. (D) Parallel graph of the standardized median of the top five CICT features for true regulatory edges (red), random edges (blue) and reverse of regulatory edges (gray). The brown dashed line shows a decision stump (one node decision tree) can discriminate true regulatory edges from random edges and reverse regulatory edges, and the dark blue dashed line shows a second decision stump that discriminates reverse of regulatory edges from random edges.

**Figure 7**
Analysis of CICT-inferred network for the mESC dataset. (A) The 20 genes with the largest number of predicted targets in CICT-inferred network (left) and in the random forest–inferred network (right) were selected, and the predicted targets for each gene were evaluated for enrichment in the GO term “stem cell population maintenance” (GO:0019827). For each gene, the bar plot shows the enrichment P-values of its predicted targets and the color of the bar indicates whether the gene itself is annotated in this GO term. (B, C) Predicted regulators of Pou5f1 (B) and Sox2 (C) in the CICT-inferred network. The distance between nodes is inversely proportional to CICT-predicted edge weights, while the thickness of an edge is proportional to the predicted edge weight. Edges are colored by whether their direction was found in the ground truth (none were in the learning set), confirmed in the recent literature or remained inconclusive. Evidence from literature is presented in Table S5.

See this image and copyright information in PMC

References

1. Kuzmanovski V, Todorovski L, Džeroski S. Extensive evaluation of the generalized relevance network approach to inferring gene regulatory networks. GigaScience 2018;7:giy118. - PMC - PubMed
1. Bonneau R, Reiss DJ, Shannon P, et al. The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol 2006;7:R36. - PMC - PubMed
1. Peignier S, Sorin B, Calevro F. Ensemble learning based gene regulatory network inference. In: 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE: Washington, DC, USA, 2021, p. 113–20.
1. Zhao M, He W, Tang J, et al. A comprehensive overview and critical evaluation of gene regulatory network inference technologies. Brief Bioinform 2021;22:bbab009. - PubMed
1. Akers K, Murali T. Gene regulatory network inference in single-cell biology. Curr Opin Syst Biol 2021;26:87–97.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions

Grants and funding

R35 GM138143/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Robust discovery of gene regulatory networks from single-cell gene expression data by Causal Inference Using Composition of Transactions

Affiliation

Robust discovery of gene regulatory networks from single-cell gene expression data by Causal Inference Using Composition of Transactions

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous