Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 7;8(46):43813-43826.
doi: 10.1021/acsomega.3c05664. eCollection 2023 Nov 21.

Machine Learning Approaches Identify Chemical Features for Stage-Specific Antimalarial Compounds

Affiliations

Machine Learning Approaches Identify Chemical Features for Stage-Specific Antimalarial Compounds

Ashleigh van Heerden et al. ACS Omega. .

Abstract

Efficacy data from diverse chemical libraries, screened against the various stages of the malaria parasite Plasmodium falciparum, including asexual blood stage (ABS) parasites and transmissible gametocytes, serve as a valuable reservoir of information on the chemical space of compounds that are either active (or not) against the parasite. We postulated that this data can be mined to define chemical features associated with the sole ABS activity and/or those that provide additional life cycle activity profiles like gametocytocidal activity. Additionally, this information could provide chemical features associated with inactive compounds, which could eliminate any future unnecessary screening of similar chemical analogs. Therefore, we aimed to use machine learning to identify the chemical space associated with stage-specific antimalarial activity. We collected data from various chemical libraries that were screened against the asexual (126 374 compounds) and sexual (gametocyte) stages of the parasite (93 941 compounds), calculated the compounds' molecular fingerprints, and trained machine learning models to recognize stage-specific active and inactive compounds. We were able to build several models that predict compound activity against ABS and dual activity against ABS and gametocytes, with Support Vector Machines (SVM) showing superior abilities with high recall (90 and 66%) and low false-positive predictions (15 and 1%). This allowed the identification of chemical features enriched in active and inactive populations, an important outcome that could be mined for essential chemical features to streamline hit-to-lead optimization strategies of antimalarial candidates. The predictive capabilities of the models held true in diverse chemical spaces, indicating that the ML models are therefore robust and can serve as a prioritization tool to drive and guide phenotypic screening and medicinal chemistry programs.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
ABS and dual-active database assembly and preprocessing for model building. (A) The pipeline used for data assembly, curation, processing, chemical featurization and model building. Data from phenotypic screening of chemical libraries against ABS and/or gametocytes were used for binary definition of active and inactive compounds. Class imbalance was addressed via cluster-based undersampling and the compounds converted into molecular descriptors ECFP or MACCS to allow model building for compound activity prediction (created with http://biorender.com/). (B,E) Class imbalance in the ABS (B) and dual-active (C) data sets after binary classification of activity vs inactivity based on criteria as specified in the original screens. (C,F) UMAP projection of the chemical space in the databases before (left-hand image) and after (right-hand image) cluster-based undersampling on inactive compounds for each database. (D,G) Distribution of active vs inactive compounds for the ABS (D) and dual-active (G) data sets after cluster-based undersampling.
Figure 2
Figure 2
Performance of different ML algorithms in identifying compounds with ABS activity. (A) ROC–AUC curves showing performance of different ML algorithms in predicting compounds with ABS activity when trained on the ECFP of compounds after fivefold cross-validation. Insert indicates AUC mean values ± standard deviation. (B) ROC–AUC curves showing the performance of different ML algorithms on the imbalanced test set. (C) Model performance metrics associated with the performance of the different models in predicting the imbalanced test set data. The F1-score evaluated model performance on imbalanced data, whereas G-mean scores determined how well models could optimize sensitivity and specificity. Recall and precision indicated accuracy of activity predictions, whereas false-positive rate (FPR) indicated error within predictions.
Figure 3
Figure 3
Performance of different ML algorithms in identifying compounds with dual activity. (A) ROC–AUC curves showing fivefold cross-validation performance of different ML algorithms in predicting compounds with dual activity when trained on the ECFP of compounds. Insert indicates AUC mean values ± standard deviation. (B) ROC–AUC curves showing performance of different ML algorithms on the imbalanced test set. (C) Model performance metrics associated with the performance of the different models in predicting imbalanced test set data. The F1-score evaluated model performance on imbalanced data, whereas G-mean scores determined how well models were able to optimize sensitivity and specificity. Recall and precision indicated accuracy of activity predictions whereas false-positive rate (FPR) indicated error within predictions.
Figure 4
Figure 4
Enriched ECFP features within inactive and active compounds for stage-specific antiplasmodial action. (A) The proportion of active/inactive compounds against ABS containing a specific ECFP feature is plotted as circles, with the size of the circles corresponding to the RF permutation score of the ECFP feature. Enrichment of a feature toward active compounds compared to inactive compounds is indicated by the p-value color obtained from the Z-test on two proportions. The top 100 enriched ECFP features within active (white) and the top 67 enriched ECFP features within inactive (black) compounds were selected according to the RF score and p-value. (B) The proportion of dual-active/inactive compounds containing a specific ECFP feature is plotted as circles with the size corresponding to the RF permutation score of the ECFP feature. Enrichment of a feature toward dual-active compounds compared to inactive compounds is indicated by the p-value color. The top 100 enriched ECFP features within dual-active (white) and the top 52 enriched ECFP features within inactive (black) compounds were selected according to RF score and p-value. (C) Comparison of the unique ECFP features associated with activity against ABS (52) or dual stages (52). For the top unique ECFP features, structural elements are indicated, with all features summarized in File S1.
Figure 5
Figure 5
Performance of the top models against unseen chemical matter. To evaluate model robustness, models were exposed to extreme data sets from the PRB box (A) and Pathogen Box (D) that were individually distinct, chemically diverse (displayed within context of the launched drug chemical space (available on StarDrop v 7.3.0), with heatbars indicating potency) and had differential activity against ABS and gametocytes. ABS and dual-activity models trained on ECFP descriptors were evaluated for their activity predictions within the PRB box (B) and the Pathogen Box (E) for F1-scores (model performance exposed to imbalanced data) and G-mean scores (ability to optimize sensitivity and specificity), recall, precision, and false-positive rate (FPR). The hit rate of the best performing model for predicting ABS and/or dual activity within these chemical spaces (C,F) was compared to random selection. The enrichment factor (EF) of models was also calculated for the top 10 and top 50 compounds to determine how effective models were in prioritizing active compounds.

References

    1. World Health Organization . World Malaria Report 2020:20 Years of Global Progress and Challenges; World Health Organization Geneva, 2020; pp 1–151.
    1. Ataba E.; Dorkenoo A. M.; Nguepou C. T.; Bakai T.; Tchadjobo T.; Kadzahlo K. D.; Yakpa K.; Atcha-Oubou T. Potential Emergence of Plasmodium Resistance to Artemisinin Induced by the Use of Artemisia annua for Malaria and COVID-19 Prevention in Sub-African Region. Acta Parasitol. 2022, 67 (1), 55–60. 10.1007/s11686-021-00489-y. - DOI - PMC - PubMed
    1. Reader J.; van der Watt M. E.; Taylor D.; Le Manach C.; Mittal N.; Ottilie S.; Theron A.; Moyo P.; Erlank E.; Nardini L.; et al. Multistage and transmission-blocking targeted antimalarials discovered from the open-source MMV Pandemic Response Box. Nat. Commun. 2021, 12 (1), 269.10.1038/s41467-020-20629-8. - DOI - PMC - PubMed
    1. Yang T.; Ottilie S.; Istvan E. S.; Godinez-Macias K. P.; Lukens A. K.; Baragana B.; Campo B.; Walpole C.; Niles J. C.; Chibale K.; et al. MalDA, Accelerating Malaria Drug Discovery. Trends Parasitol. 2021, 37 (6), 493–507. 10.1016/j.pt.2021.01.009. - DOI - PMC - PubMed
    1. Birkholtz L. M.; Alano P.; Leroy D. Transmission-blocking drugs for malaria elimination. Trends Parasitol. 2022, 38 (5), 390–403. 10.1016/j.pt.2022.01.011. - DOI - PubMed