Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 27;26(5):2085.
doi: 10.3390/ijms26052085.

Machine Learning Methods for Classifying Multiple Sclerosis and Alzheimer's Disease Using Genomic Data

Affiliations

Machine Learning Methods for Classifying Multiple Sclerosis and Alzheimer's Disease Using Genomic Data

Magdalena Arnal Segura et al. Int J Mol Sci. .

Abstract

Complex diseases pose challenges in prediction due to their multifactorial and polygenic nature. This study employed machine learning (ML) to analyze genomic data from the UK Biobank, aiming to predict the genomic predisposition to complex diseases like multiple sclerosis (MS) and Alzheimer's disease (AD). We tested logistic regression (LR), ensemble tree methods, and deep learning models for this purpose. LR displayed remarkable stability across various subsets of data, outshining deep learning approaches, which showed greater variability in performance. Additionally, ML methods demonstrated an ability to maintain optimal performance despite correlated genomic features due to linkage disequilibrium. When comparing the performance of polygenic risk score (PRS) with ML methods, PRS consistently performed at an average level. By employing explainability tools in the ML models of MS, we found that the results confirmed the polygenicity of this disease. The highest-prioritized genomic variants in MS were identified as expression or splicing quantitative trait loci located in non-coding regions within or near genes associated with the immune response, with a prevalence of human leukocyte antigen (HLA) gene annotations. Our findings shed light on both the potential and the challenges of employing ML to capture complex genomic patterns, paving the way for improved predictive models.

Keywords: Alzheimer’s disease; deep learning; extremely randomized trees; gradient-boosted decision trees; logistic regression; machine learning; multiple sclerosis; polygenic risk score.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Design of the study. MS: Multiple Sclerosis; AD: Alzheimer’s Disease; UKB: UK Biobank; SNVs: Single Nucleotide Variants; HLA: Human Leukocyte Antigen; CV: Cross-Validation; GB: Gradient-Boosted Decision Trees; ET: Extremely Randomized Trees; RF: Random Forest; LR: Logistic Regression; FFN: Feedforward Neural Networks; CNN: Convolutional Neural Networks; PRS: Polygenic Risk Score; RFE: Recursive Feature Elimination; RFECV: Recursive Feature Elimination with Cross-Validation; IMSGC: International Multiple Sclerosis Genetics Consortium; ADNI: Alzheimer’s Disease Neuroimaging Initiative. More details can be found in the Section 4 and in the Supplementary Materials.
Figure 2
Figure 2
The values of evaluation metrics across the five folds of the outer loop in the nested cross-validation, obtained by training the models on the UK Biobank (UKB) cohort and testing them on different datasets. (a) The balanced accuracy obtained when testing on the UKB cohort. (b) The balanced accuracy values for the Alzheimer’s disease (AD) models tested on the UKB and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohorts. (c) The sensitivity values for the multiple sclerosis (MS) models tested on the UKB and the International Multiple Sclerosis Genetics Consortium (IMSGC) MS cohorts. (d) The sensitivity values for the UKB and IMSGC MSRD cohorts. GB: Gradient-Boosted Decision Trees; ET: Extremely Randomized Trees; RF: Random Forest; LR: Logistic Regression; FFN: Feedforward Neural Networks; CNN: Convolutional Neural Networks.
Figure 3
Figure 3
Assessing the robustness of predictive models. Percentage of controls correctly classified by 0 to 6 machine learning methods in comparison with samples correctly classified as true positives (TP) or true negatives (TN), and incorrectly classified as false negatives (FN) or false positives (FP) by polygenic risk score (PRS) models. The total number of samples is indicated above the bars for the groups having the highest percentage in each comparison. Panels (a,b) show the classification of cases and controls in multiple sclerosis (MS) models, respectively. Panels (c,d) show the classification of cases and controls in Alzheimer’s disease (AD) models, respectively. (eh) Sensitivity and specificity in the original models, and models after feature selection with Recursive Feature Elimination with Cross-Validation (RFECV) and Recursive Feature Elimination (RFE). Sensitivity and specificity in multiple sclerosis are represented in plots (e) and (f), respectively. Sensitivity and specificity in Alzheimer’s disease are represented in plots (g) and (h), respectively. GB: Gradient-Boosted Decision Trees; ET: Extremely Randomized Trees; RF: Random Forest; LR: Logistic Regression.
Figure 4
Figure 4
From left to right: (a) the allele frequency (AF) of the rs429358 (C) minor allele; (b) the percentage of individuals with the heterozygous form of the allele (C;T); (c) the percentage of individuals with the two copies of the minor allele (C;C). The x-axis represents controls, individuals with Alzheimer’s disease (AD), AD correctly classified by the six machine learning methods (TP across ML), and AD correctly classified by the six machine learning methods and the polygenic risk score (TP across ML and PRS).
Figure 5
Figure 5
Circos plot representing all the genomic features used in the multiple sclerosis models distributed across the genome. The heatmap indicates the ranks of the features as assigned by each machine learning method, with values close to 1 in red indicating higher importance. The variants that were prioritized by at least one method are indicated with their names. The names of the single nucleotide variants (SNVs) are colored in purple if they are annotated with an expression quantitative trait loci (eQTL) or splicing quantitative trait loci (sQTL) in at least one tissue in the Genotype-Tissue Expression Portal (GTEx). The labels of missense SNVs with annotated QTLs are colored in green. The labels of chromosome 6 are excluded due to the high density of prioritized genomic variants in this chromosome. GB: Gradient-Boosted Decision Trees; ET: Extremely Randomized Trees; RF: Random Forest; LR: Logistic Regression; FFN: Feedforward Neural Networks; CNN: Convolutional Neural Networks.
Figure 6
Figure 6
The heatmap on the left represents the ranks of all the features on chromosome 6 as assigned by each machine learning method, with values close to 1 in red indicating higher importance. The top 10 best-ranked genomic variants on this chromosome are labeled with their corresponding names. Labels in purple indicate the presence of an expression quantitative trait loci (eQTL) or splicing quantitative trait loci (sQTL) in at least one tissue in the Genotype-Tissue Expression Portal (GTEx). The heatmap on the right indicates the presence and strength of linkage disequilibrium between pairs of genomic variants. GB: Gradient-Boosted Decision Trees; ET: Extremely Randomized Trees; RF: Random Forest; LR: Logistic Regression; FFN: Feedforward Neural Networks; CNN: Convolutional Neural Networks.

References

    1. Uffelmann E., Huang Q.Q., Munung N.S., de Vries J., Okada Y., Martin A.R., Martin H.C., Lappalainen T., Posthuma D. Genome-Wide Association Studies. Nat. Rev. Methods Primers. 2021;1:1–21. doi: 10.1038/s43586-021-00056-9. - DOI
    1. Wang G., Sarkar A., Carbonetto P., Stephens M. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. J. R. Stat. Soc. Ser. B Stat. Methodol. 2020;82:1273–1300. doi: 10.1111/rssb.12388. - DOI - PMC - PubMed
    1. Collister J.A., Liu X., Clifton L. Calculating Polygenic Risk Scores (PRS) in UK Biobank: A Practical Guide for Epidemiologists. Front. Genet. 2022;13:818574. doi: 10.3389/fgene.2022.818574. - DOI - PMC - PubMed
    1. Lipton Z.C. The Mythos of Model Interpretability. Commun. ACM. 2016;61:35–43. doi: 10.1145/3233231. - DOI
    1. Lin J., Ngiam K.Y. How Data Science and AI-Based Technologies Impact Genomics. Singap. Med. J. 2023;64:59–66. doi: 10.4103/singaporemedj.SMJ-2021-438. - DOI - PMC - PubMed

LinkOut - more resources