Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 24;29(1):12.
doi: 10.1186/s10020-023-00603-y.

Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning

Affiliations

Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning

Kyriaki Founta et al. Mol Med. .

Abstract

Background: Amyotrophic lateral sclerosis (ALS) is a rare progressive neurodegenerative disease that affects upper and lower motor neurons. As the molecular basis of the disease is still elusive, the development of high-throughput sequencing technologies, combined with data mining techniques and machine learning methods, could provide remarkable results in identifying pathogenetic mechanisms. High dimensionality is a major problem when applying machine learning techniques in biomedical data analysis, since a huge number of features is available for a limited number of samples. The aim of this study was to develop a methodology for training interpretable machine learning models in the classification of ALS and ALS-subtypes samples, using gene expression datasets.

Methods: We performed dimensionality reduction in gene expression data using a semi-automated preprocessing systematic gene selection procedure using Statistically Equivalent Signature (SES), a causality-based feature selection algorithm, followed by Boosted Regression Trees (XGBoost) and Random Forest to train the machine learning classifiers. The SHapley Additive exPlanations (SHAP values) were used for interpretation of the machine learning classifiers. The methodology was developed and tested using two distinct publicly available ALS RNA-seq datasets. We evaluated the performance of SES as a dimensionality reduction method against: (a) Least Absolute Shrinkage and Selection Operator (LASSO), and (b) Local Outlier Factor (LOF).

Results: The proposed methodology achieved 85.18% accuracy for the classification of cerebellum or frontal cortex samples as C9orf72-related familial ALS, sporadic ALS or healthy samples. Importantly, the genes identified as the most determinative have also been reported as disease-associated in ALS literature. When tested in the evaluation dataset, the methodology achieved 88.89% accuracy for the classification of sporadic ALS motor neuron samples. When LASSO was used as feature selection method instead of SES, the accuracy of the machine learning classifiers ranged from 74.07 to 96.30%, depending on tissue assessed, while LOF underperformed significantly (77.78% accuracy for the classification of pooled cerebellum and frontal cortex samples).

Conclusions: Using SES, we addressed the challenge of high dimensionality in gene expression data analysis, and we trained accurate machine learning ALS classifiers, specific for the gene expression patterns of different disease subtypes and tissue samples, while identifying disease-associated genes.

Keywords: Causality-based feature selection; Dimensionality reduction; Gene expression; Machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic of the developed methodology
Fig. 2
Fig. 2
Differential expression levels (Log2 fold-change values) of the SES-selected genes in the sporadic ALS differential expression dataset (top panel), and the genes that were not selected by SES in the sporadic ALS differential expression dataset (bottom panel)
Fig. 3
Fig. 3
Differential expression levels (Log2 fold-change values) of the SES-selected genes in the C9orf72-related familial ALS differential expression dataset (top panel), and the genes that were not selected by SES in the C9orf72-related familial ALS differential expression dataset (bottom panel)

Similar articles

Cited by

References

    1. Anna Roumpelaki KB. Package “MXM” Type Package Title Feature Selection (Including Multiple Solutions) and Bayesian Networks. 2022. https://cran.r-project.org/web/packages/MXM/MXM.pdf
    1. Aronica E, Baas F, Iyer A, ten Asbroek ALMA, Morello G, Cavallaro S. Molecular classification of amyotrophic lateral sclerosis by unsupervised clustering of gene expression in motor cortex. Neurobiol Dis. 2015;74:359–376. doi: 10.1016/j.nbd.2014.12.002. - DOI - PubMed
    1. Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion. 2020;1(58):82–115. doi: 10.1016/j.inffus.2019.12.012. - DOI
    1. Batra R, Hutt K, Vu A, Rabin SJ, Baughn MW, Libby RT, et al. Gene Expression Signatures of Sporadic ALS Motor Neuron Populations. Neuroscience. 2016 doi: 10.1101/038448. - DOI
    1. Bean DM, Al-Chalabi A, Dobson RJB, Iacoangeli A. A knowledge-based machine learning approach to gene prioritisation in amyotrophic lateral sclerosis. Genes. 2020;11(6):668. doi: 10.3390/genes11060668. - DOI - PMC - PubMed

Publication types

Supplementary concepts