Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 7;23(1):325.
doi: 10.1186/s12859-022-04870-0.

Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach

Affiliations

Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach

Kah Yee Tai et al. BMC Bioinformatics. .

Abstract

Background: The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS).

Results: We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction.

Conclusions: Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score).

Keywords: Feature extraction algorithm; Genetic risk factors; Machine learning; Malaria; Single nucleotide polymorphisms; Weighted genetic risk score.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Machine learning pipeline for individual malaria risk score prediction
Fig. 2
Fig. 2
Methodology flow chart
Fig. 3
Fig. 3
Pseudocode of the proposed feature extraction algorithm
Fig. 4
Fig. 4
Overview of genotype-pattern-frequency-based features
Fig. 5
Fig. 5
High-level pseudocode of the feature extraction and selection stage
Fig. 6
Fig. 6
Feature importance ranking of all 104 SNPs, computed using the proposed feature extraction algorithm with LR-RFE
Fig. 7
Fig. 7
Feature importance ranking of all 104 SNPs, computed using the benchmark feature extraction algorithm with LR-RFE
Fig. 8
Fig. 8
Comparison of feature importance scores using different feature extraction algorithms with LR-RFE: (1) proposed algorithm and (2) benchmark algorithm
Fig. 9
Fig. 9
Performance analysis of the wGRS-based and wGRS + GF-based models with respect to MAE scores and feature sets

Similar articles

References

    1. World Health Organization. World malaria report 2020: 20 years of global progress and challenges. World Health Organization; 2020. Available from: https://www.who.int/docs/default-source/malaria/world-malaria-reports/97....
    1. Childs LM, Cai FY, Kakani EG, Mitchell SN, Paton D, Gabrieli P, et al. Disrupting mosquito reproduction and parasite development for malaria control. PLoS Pathog. 2016;12(12):e1006060. doi: 10.1371/journal.ppat.1006060. - DOI - PMC - PubMed
    1. Tizifa TA, Kabaghe AN, McCann RS, Van den Berg H, Van Vugt M, Phiri KS. Prevention efforts for malaria. Curr Trop Med Rep. 2018;5(1):41–50. doi: 10.1007/s40475-018-0133-y. - DOI - PMC - PubMed
    1. Fortin A, Stevenson MM, Gros P. Susceptibility to malaria as a complex trait: big pressure from a tiny creature. Hum Mol Genet. 2002;11(20):2469–2478. doi: 10.1093/hmg/11.20.2469. - DOI - PubMed
    1. Mackinnon MJ, Mwangi TW, Snow RW, Marsh K, Williams TN. Heritability of malaria in Africa. PLoS Med. 2005;2(12):e340. doi: 10.1371/journal.pmed.0020340. - DOI - PMC - PubMed