. 2023 Oct 2;19(10):e1011476.

doi: 10.1371/journal.pcbi.1011476. eCollection 2023 Oct.

XA4C: eXplainable representation learning via Autoencoders revealing Critical genes

Qing Li¹, Yang Yu², Pathum Kossinna¹, Theodore Lun¹, Wenyuan Liao², Qingrun Zhang^{1

2

3

4}

Affiliations

¹ Department of Biochemistry & Molecular Biology, University of Calgary, Calgary, Canada.
² Department of Mathematics and Statistics, University of Calgary, Calgary, Canada.
³ Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Canada.
⁴ Arnie Charbonneau Cancer Institute, University of Calgary, Calgary, Canada.

PMID: 37782668
PMCID: PMC10569512
DOI: 10.1371/journal.pcbi.1011476

XA4C: eXplainable representation learning via Autoencoders revealing Critical genes

Qing Li et al. PLoS Comput Biol. 2023.

. 2023 Oct 2;19(10):e1011476.

doi: 10.1371/journal.pcbi.1011476. eCollection 2023 Oct.

Authors

Qing Li¹, Yang Yu², Pathum Kossinna¹, Theodore Lun¹, Wenyuan Liao², Qingrun Zhang^{1

2

3

4}

Affiliations

¹ Department of Biochemistry & Molecular Biology, University of Calgary, Calgary, Canada.
² Department of Mathematics and Statistics, University of Calgary, Calgary, Canada.
³ Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Canada.
⁴ Arnie Charbonneau Cancer Institute, University of Calgary, Calgary, Canada.

PMID: 37782668
PMCID: PMC10569512
DOI: 10.1371/journal.pcbi.1011476

Abstract

Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the "latent variables" in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL's broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose "Critical genes", defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene's contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET) and a cancer-specific database (COSMIC), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.

Copyright: © 2023 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. The XA4C model and potential downstream analysis.**
(A) An autoencoder is constructed to learn representations (i.e., latent variables) of input gene expression profiles. (B) XGBoost and TreeSHAP are utilized to evaluate SHAP values and Critical indexes for all genes. (C) Critical genes are the ones with the top 1% Critical indexes. (D) KEGG pathway enrichment identifies sensible pathways overrepresented by prioritized genes with SHAP values. (E) Connectivity analysis discloses interaction patterns among genes centered by Critical genes in pathways.

**Fig 2. Whole transcriptome Critical indexes of genes in six cancers.**
(A) Genes with the largest 30 Critical indexes summarized among all latent variables and averaged across samples. (B) Distribution of whole transcriptome Critical indexes for all genes. (C) Distribution of whole transcriptome Critical indexes for Critical genes.

**Fig 3. Pathway enrichment of whole-transcriptome genes.**
(A) Top 20 KEGG pathways enriched by genes with non-zero Critical indexes. The p-values are listed in S2 Table. (B) Comparison of pathways enrichment of genes prioritized by XA4C, DiffEx analysis and DiffCoEx analysis.

**Fig 4. Generation and analysis of within-pathway Critical genes.**
(A) Distribution of R² (in testing samples) of pathway AEs in six cancers. (B) Overlaps between Critical genes and Hub genes (identified by WGCNA). (C) Overlaps between Critical genes and DiffEx genes. (D) Numbers of Critical, Hub, and DiffEx genes validated by DisGeNET. (E) Percentage of Critical, Hub, and DiffEx genes validated by DisGeNET.

**Fig 5. Critical genes show distinct co-expression networks in tumor and normal tissues.**
The Lysine degradation pathway (I00310) is used. Critical genes (light blue) are located at the core of the network, surrounded by additional genes from the same pathway (gray). The boundaries of Pearson’s correlation coefficients range from +0.8 (red) to -0.8 (blue). Boxplots show the distributions of two sets of correlations (tumor vs. normal) together with the P-value of the Kolmogorov-Smirnov test, with the null hypothesis being that the two samples were chosen from the same distribution. Critical genes shown in this figure are novel as they have not been identified by traditional analysis search for Hub nor DiffEx genes.

See this image and copyright information in PMC

References

1. Goodfellow I, Bengio Y, Courville A. Deep learning: MIT press; 2016.
1. Stegle O, Parts L, Piipari M, Winn J, Durbin R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat Protoc. 2012;7(3):500–7. doi: 10.1038/nprot.2011.457 - DOI - PMC - PubMed
1. Taroni JN, Grayson PC, Hu QW, Eddy S, Kretzler M, Merkel PA, et al.. MultiPLIER: A Transfer Learning Framework for Transcriptomics Reveals Systemic Features of Rare Disease. Cell Syst. 2019;8(5):380-+. doi: 10.1016/j.cels.2019.04.003 - DOI - PMC - PubMed
1. Dwivedi SK, Tjarnberg A, Tegner J, Gustafsson M. Deriving disease modules from the compressed transcriptional space embedded in a deep autoencoder. Nat Commun. 2020;11(1). doi: 10.1038/s41467-020-14666-6 - DOI - PMC - PubMed
1. Jiayi B, Qing L, Albert L, Guotao Y, Jun Y, Jingjing W, et al.. Autoencoder-transformed transcriptome improves genotype-phenotype association studies. bioRxiv. 2023. 10.1101/2023.07.23.550223. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

XA4C: eXplainable representation learning via Autoencoders revealing Critical genes

Affiliations

XA4C: eXplainable representation learning via Autoencoders revealing Critical genes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Research Materials