Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 1;15(1):2817.
doi: 10.1038/s41467-024-46663-4.

Data-driven identification of predictive risk biomarkers for subgroups of osteoarthritis using interpretable machine learning

Affiliations

Data-driven identification of predictive risk biomarkers for subgroups of osteoarthritis using interpretable machine learning

Rikke Linnemann Nielsen et al. Nat Commun. .

Abstract

Osteoarthritis (OA) is increasing in prevalence and has a severe impact on patients' lives. However, our understanding of biomarkers driving OA risk remains limited. We developed a model predicting the five-year risk of OA diagnosis, integrating retrospective clinical, lifestyle and biomarker data from the UK Biobank (19,120 patients with OA, ROC-AUC: 0.72, 95%CI (0.71-0.73)). Higher age, BMI and prescription of non-steroidal anti-inflammatory drugs contributed most to increased OA risk prediction ahead of diagnosis. We identified 14 subgroups of OA risk profiles. These subgroups were validated in an independent set of patients evaluating the 11-year OA risk, with 88% of patients being uniquely assigned to one of the 14 subgroups. Individual OA risk profiles were characterised by personalised biomarkers. Omics integration demonstrated the predictive importance of key OA genes and pathways (e.g., GDF5 and TGF-β signalling) and OA-specific biomarkers (e.g., CRTAC1 and COL9A1). In summary, this work identifies opportunities for personalised OA prevention and insights into its underlying pathogenesis.

PubMed Disclaimer

Conflict of interest statement

R.L.N., T.M., R.R.K., L.G.L., A.T.H.S., F.S.G., C.S., L.S., M.W., A.A.T., Z.M., and R.G. are employed by Novo Nordisk and own minor company stock. L.E. is currently employed by Nordic Bioscience A/S and was working on this manuscript while being employed by Novo Nordisk A/S. M.H. declares no competing interests.

Figures

Fig. 1
Fig. 1. Study design and population characteristics.
A Overview of study design including example of date matching cases and controls (longitudinal patient data). For patients (cases) diagnosed with osteoarthritis (OA), the OA diagnosis date was identified and a data capture period of 5 years prior to diagnosis created. For individuals not diagnosed with OA (controls), a matched index date, equivalent to the OA diagnosis date for the case used for matching, was identified. For controls, a data capture period of 5 years prior to the index date was created. For both cases and controls, longitudinal electronic health record (EHR) data and data from the UK Biobank assessment centre were captured in the 5-year data capture period. B Upset plot of joints affected that could be extracted from the OA diagnosis codes. It was not possible to map all OA diagnoses to a specific joint (marked as unspecified OA). These groups were used for stratification in the prediction models. The set size represents the full number of patients that could be identified. When a patient with OA had multiple joints affected, both diagnoses were included in this joint mapping and hence the set size reflects the 19,120 OA cases identified. The intersection sizes represent the number of patients with at least the given set of diagnoses. C OA risk factors summarised across OA cases (OA) and matched non-OA controls (No OA) used for modelling in the study. P-values (uncorrected for multiple testing) were generated by two-sided Welch’s t-test for continuous features and chi-squared test of independence for sex (F = Female, M = Male).
Fig. 2
Fig. 2. Identification of OA risk biomarkers using machine learning.
A Data integration strategy overview for multi-omics data. B Study design for predictive machine learning modelling setup for training and model validation. C Study design for identification of OA risk biomarkers at population, precision and personalised risk levels using interpretable AI approaches (SHAP).
Fig. 3
Fig. 3. Clinical prediction model of osteoarthritis (OA) risk (Clin model).
A ROC curves of OA prediction models. B Precision-recall curves of OA prediction models. ‘OA all’incudes all cases of OA independent of specific joint subsets (Arm, Foot, Hip, Knee, Spine). An OA case can have multiple OA joints affected and a case is included per joint affected (meaning these can be repeated). C Performance metrics of Clin model on independent five-fold cross-validation test datasets. PPV positive predictive value, NPV negative predictive value. D Ranked feature importance of OA model by SHAP additive explanations for top 40 predictive features in the model. OA osteoarthritis, NSAIDs non-steroidal anti-inflammatory steroid drugs, FM/FFM Fat mass/Fat-free mass.
Fig. 4
Fig. 4. Osteoarthritis (OA) patient clustering and characterisation.
A Clusters obtained based on the SHAP values (Louvain clustering algorithm). B OA prediction probability per individuals. C SHAP values used for clustering; the colour scheme allows visualisation of the importance of each feature in the prediction model as well as their impact on clustering. For (B) and (C), points were binned to increase readability; the average values within each of the bins are plotted. Clusters were obtained after dimensionality reduction of the SHAP data with a principal component analysis (10 PCs) and visualised after further dimensionality reduction with UMAP. D Average values of prediction probabilities, the top 6 most predictive features in the OA model, and circulating plasma biomarkers levels for the most differentiated proteins between OA cases in each cluster. Categorical values encoding before averaging: NSAIDs (pre-1years): 1 and 0 represent patient taking or not taking the drug, respectively; walking paces were rescaled from 0 to 1 corresponding to a range from fastest to slowest; health ratings were rescaled from 0 to 1 corresponding to a range, from healthiest to least healthy. Continuous values (including proteomics data) encoding before averaging: values were transformed into Z-scores for visualisation purposes. Plasma proteomics biomarkers have been identified using the OA cases in each cluster with available Olink data (N = 1723 total) and taking the top 2 most significantly up- or down-regulated proteins per cluster. Proteins can be a biomarker for multiple clusters, resulting in 19 biomarkers for 14 clusters; significant proteins are annotated with an asterisk (*) (adjusted p-value ≤ 0.05, logistic regression adjusted for sex; p-values Bonferroni-corrected per cluster. Full results and exact p-values provided in Supplementary Data 3, as well as cluster sample sizes). OA osteoarthritis, NSAIDs non-steroidal anti-inflammatory steroid drugs.
Fig. 5
Fig. 5. Cluster prediction metrics and defining rules in the osteoarthritis (OA) study population (time-window between data collection at the assessment centre and OA diagnosis (any OA): less than 5 years) and validation in independent hold-out population (time-window: 5 to 11 years).
Left (heatmap): each cluster is defined by prediction metrics, percentage of cases, cluster size (%). Middle (text): set of rules best defining each cluster, based on the model input values and generated by a decision tree model. Right (heatmap): percentages of cases and cluster size (%) in an independent population in which individuals were attributed to clusters according to their corresponding rules. OA osteoarthritis, NSAIDs non-steroidal anti-inflammatory steroid drugs, Avg Pred Prob average prediction probability per cluster, PPV Positive predictive value.
Fig. 6
Fig. 6. Individual risk profiles.
For example patients from (A) cluster 1, (B) cluster 2, (C) cluster 12 and (D) cluster 9. Waterfall plots show the top 15 most important features for estimating the OA risk at the individual level. Yellow bars (positive SHAP value) indicate features that increased predicted OA risk; red bars (negative SHAP value) indicate features that decreased predicted OA risk. Numbers within bars represent the SHAP value for the feature; numbers on the y-axis represent the value for this feature, both are specific to the individual shown and represented the magnitude of the effect of the risk biomarker on predicted OA risk. OA osteoarthritis, NSAIDs non-steroidal anti-inflammatory steroid drugs.
Fig. 7
Fig. 7. Multi-omics osteoarthritis (OA) risk models and biomarkers.
A Top ranked features from omics models in the context of multi-modal clinical features. The top five omics features that appeared important for prediction of OA based on the average marginal SHAP value ranking amongst top 40 predictive features. For ClinSNP, ClinGRS and ClinPath, several sensitivity checks were done for the genetic features including results marked with: (1): GRS obtained with proxy, (2): GRS obtained without proxy, and * Identified for models with genetic features corrected for population stratification (Supplementary Table 3 for details). For the column Gene OA Risk Score, italic refers to the defined gene loci. B Ranked feature importance of ClinGRS (proxy) model by SHAP additive explanations for top 40 predictive features in the model. C Ranked feature importance of ClinPath (proxy) model by SHAP additive explanations for top 40 predictive features in the model. D Ranked feature importance of ClinMet model by SHAP additive explanations for top 40 predictive features in the model. E Ranked feature importance of ClinPro model by SHAP additive explanations for top 40 predictive features in the model. OA Osteoarthritis, NSAIDs non-steroidal anti-inflammatory steroid drugs, FM/FFM Fat mass/Fat-free mass.
Fig. 8
Fig. 8. Osteoarthritis (OA) joint-specific models.
A Feature importance of the top 15 features ranked by SHAP additive explanations are provided for joint-specific models of people with diagnosis of OA in the arm, foot, spine, hip or knee. Detailed descriptions of the features listed are found in Supplementary Data 4. B ROC-curves of knee-specific models trained only using only patients diagnosed with knee OA using BMI only (BMI), or similar features as in the Clin (5341 patients with knee OA, 19,252 controls), ClinGRS with proxy (5205 patients with knee OA, 18,779 controls), ClinPath with proxy (5205 patients with knee OA, 18,779 controls), ClinMet (1265 patients with knee OA, 4519 controls), or ClinPro (488 patients with knee OA, 1816 controls) models. C Ranked feature importance of ClinPro knee-specific model by SHAP additive explanations for top 40 predictive features in the model.

References

    1. Leifer VP, Katz JN, Losina E. The burden of OA-health services and economics. Osteoarthr. Cartil. 2022;30:10–16. doi: 10.1016/j.joca.2021.05.007. - DOI - PMC - PubMed
    1. Roos EM, Arden NK. Strategies for the prevention of knee osteoarthritis. Nat. Rev. Rheumatol. 2016;12:92–101. doi: 10.1038/nrrheum.2015.135. - DOI - PubMed
    1. Cook MJ, Verstappen SMM, Lunt M, O’Neill TW. Increased frailty in individuals with osteoarthritis and rheumatoid arthritis and the influence of comorbidity: an analysis of the UK Biobank cohort. Arthritis Care Res. (Hoboken) 2022;74:1989–1996. doi: 10.1002/acr.24747. - DOI - PubMed
    1. Jamshidi A, Pelletier JP, Martel-Pelletier J. Machine-learning-based patient-specific prediction models for knee osteoarthritis. Nat. Rev. Rheumatol. 2019;15:49–60. doi: 10.1038/s41584-018-0130-5. - DOI - PubMed
    1. Boer CG, et al. Deciphering osteoarthritis genetics across 826,690 individuals from 9 populations. Cell. 2021;184:4784–4818.e17. doi: 10.1016/j.cell.2021.07.038. - DOI - PMC - PubMed