Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan:75:103800.
doi: 10.1016/j.ebiom.2021.103800. Epub 2022 Jan 10.

Functional coding haplotypes and machine-learning feature elimination identifies predictors of Methotrexate Response in Rheumatoid Arthritis patients

Affiliations

Functional coding haplotypes and machine-learning feature elimination identifies predictors of Methotrexate Response in Rheumatoid Arthritis patients

Ashley J W Lim et al. EBioMedicine. 2022 Jan.

Abstract

Background: Major challenges in large scale genetic association studies include not only the identification of causative single nucleotide polymorphisms (SNPs), but also accounting for SNP-SNP interactions. This study thus proposes a novel feature engineering approach integrating potentially functional coding haplotypes (pfcHap) with machine-learning (ML) feature selection to identify biologically meaningful, possibly causative genetic factors, that take into consideration potential SNP-SNP interactions within the pfcHap, to best predict for methotrexate (MTX) response in rheumatoid arthritis (RA) patients.

Methods: Exome sequencing from 349 RA patients were analysed, of which they were split into training and unseen test set. Inferred pfcHaps were combined with 30 non-genetic features to undergo ML recursive feature elimination with cross-validation using the training set. Predictive capacity and robustness of the selected features were assessed using six popular machine learning models through a train set cross-validation and evaluated in an unseen test set.

Findings: Significantly, 100 features (95 pfcHaps, 5 non-genetic factors) were identified to have good predictive performance (AUC: 0.776-0.828; Sensitivity: 0.656-0.813; Specificity: 0.684-0.868) across all six ML models in an unseen test dataset for the prediction of MTX response in RA patients.

Interpretation: Majority of the predictive pfcHap SNPs were predicted to be potentially functional and some of the genes in which the pfcHap resides in were identified to be associated with previously reported MTX/RA pathways.

Funding: Singapore Ministry of Health's National Medical Research Council (NMRC) [NMRC/CBRG/0095/2015; CG12Aug17; CGAug16M012; NMRC/CG/017/2013]; National Cancer Center Research Fund and block funding Duke-NUS Medical School.; Singapore Ministry of Education Academic Research Fund Tier 2 grant MOE2019-T2-1-138.

Keywords: Feature selection; Genetic polymorphism; Haplotypes; Machine learning; Methotrexate; Rheumatoid Arthritis.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest CGL, KPL, CCK, SSC, AJWL, and LJL declare that they have a pending Coversheet IP application.

Figures

Fig. 1
Figure 1
Pipeline employed to identify the predictors of MTX response. 349 patient samples were first divided into training (n=279, 70%) and test (n=70, 30%) sets using a stratified split, such that datasets consist of the proportion of responders and non-responders that is representative of the original dataset. The training set was then further split into eight subsets consisting of different sample size ranging from 30% to 100% of the samples in the training set. Within each subset, the important features (coding haplotypes or integration of coding haplotypes with non-genetic features) were selected using recursive feature elimination with cross-validation (RFECV), applied with Random Forest Classifier as the estimator of choice. The important features that were commonly identified in all eight subsets were then shortlisted and identified as the set of important features that are predictive of MTX response. The predictive performance of these features was assessed in six different machine learning models, using cross-validation within the training set and the unseen test dataset.
Fig. 2
Figure 2
Predictive performance of 120 haplotypes (from Haplotype-only analysis) in the training set using 5-fold cross-validation. ROC curves of 120 haplotypes using (a) Random Forest, (b) Logistic Regression, (c) Support Vector Machine, (d) Boosted Trees, (e) Elastic Net, and (f) Neural Network.
Fig. 3
Figure 3
Number of important features (coding haplotypes and non-genetic factors) identified in eight training subsets of variable sample sizes. Columns represent the different training subsets and each row represent the features. Intensity of red represent the importance of the feature in each subset (i.e., Greater intensity represent features of greater importance and vice versa); Black represents features that are not found to be important in the respective subset.
Fig. 4
Figure 4
Predictive performance of 95 haplotypes and 5 non-genetic factors in the training set using 5-fold cross-validation. ROC curves of 95 haplotypes and 5 non-genetic factors using (a) Random Forest, (b) Logistic Regression, (c) Support Vector Machine, (d) Boosted Trees, (e) Elastic Net, and (f) Neural Network.
Fig. 5
Figure 5
Predictive performance of 95 haplotypes and 5 non-genetic factors in the unseen test set. ROC curves of 95 haplotypes and 5 non-genetic factors using (a) Random Forest, (b) Logistic Regression, (c) Support Vector Machine, (d) Boosted Trees, (e) Elastic Net, and (f) Neural Network.

References

    1. Relling MV., Klein TE. CPIC: Clinical pharmacogenetics implementation consortium of the pharmacogenomics research network. Clin Pharmacol Ther. 2011;89:464–467. doi: 10.1038/clpt.2010.279. - DOI - PMC - PubMed
    1. Relling MV., Klein TE, Gammal RS, Whirl-Carrillo M, Hoffman JM, Caudle KE. The clinical pharmacogenetics implementation consortium: 10 years later. Clin Pharmacol Ther. 2020;107:171–175. doi: 10.1002/cpt.1651. - DOI - PMC - PubMed
    1. Roden DM, Mcleod HL, Relling MV, Williams MS, Mensah GA, Peterson JF, et al. Pharmacogenomics HHS public access. Lancet. 2019;394:521–532. doi: 10.1016/S0140-6736(19)31276-0. - DOI - PMC - PubMed
    1. Poldrack RA, Huckins G, Varoquaux G. Establishment of best practices for evidence for prediction: a review. JAMA Psychiatry. 2020;77:534–540. doi: 10.1001/jamapsychiatry.2019.3671. - DOI - PMC - PubMed
    1. Varga TV., Niss K, Estampador AC, Collin CB, Moseley PL. Association is not prediction: A landscape of confused reporting in diabetes – A systematic review. Diabetes Res Clin Pract. 2020;170 doi: 10.1016/j.diabres.2020.108497. - DOI - PubMed

MeSH terms