Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 25:16:1607276.
doi: 10.3389/fphys.2025.1607276. eCollection 2025.

Machine learning-based prediction of knee pain risk using lipid metabolism biomarkers: a prospective cohort study from CHARLS

Affiliations

Machine learning-based prediction of knee pain risk using lipid metabolism biomarkers: a prospective cohort study from CHARLS

Biao Guo et al. Front Physiol. .

Abstract

Introduction: Knee pain significantly impairs health and quality of life among middle-aged and older adults. However, the predictive utility of lipid metabolism biomarkers for knee pain risk remains inadequately explored.

Methods: This study utilized data from the China Health and Retirement Longitudinal Study (CHARLS, 2011-2013) to investigate the association between lipid-related metabolic indicators and the risk of knee pain. Multiple lipid biomarkers and composite indices-including the lipid accumulation product (LAP), triglyceride-glucose (TyG) index, and TyG-BMI-were incorporated. Five machine learning models were developed and evaluated for predictive performance. Model interpretation was conducted using SHAP (SHapley Additive exPlanations) to identify the most influential predictors.

Results: A higher prevalence of knee pain was observed in high-altitude, cold regions such as Qinghai and Sichuan provinces. Composite metabolic indices (LAP, TyG, and TyG-BMI) exhibited stronger predictive power than traditional single lipid markers. Among the models, the Stacked Ensemble algorithm achieved the best performance, with an AUC of 0.85 and a Brier score of 0.13. SHAP analysis highlighted LAP and TyG-related indices as the top contributors to prediction outcomes.

Discussion: These findings emphasize the importance of lipid metabolism indicators in the early identification of knee pain risk. The integration of interpretable machine learning approaches and composite metabolic indices offers a promising strategy for personalized prevention in aging populations.

Keywords: CHARLS; knee pain; lipid accumulation product; machine learning; metabolic biomarkers.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Flow chart of data processing.
FIGURE 2
FIGURE 2
Provincial distribution of knee pain case counts among middle-aged and older adults adults in China. The choropleth map illustrates the number of reported knee pain cases by province based on the CHARLS dataset. Darker shades indicate a higher concentration of knee pain cases, with Sichuan, Hunan, and Henan provinces showing the highest prevalence.
FIGURE 3
FIGURE 3
Geographic Distribution of Knee Pain Prevalence Across Chinese Provinces. This map shows the provincial-level prevalence of knee pain among middle-aged and older adults individuals in China, based on the CHARLS dataset. Darker blue areas represent provinces with higher prevalence rates, with Sichuan and adjacent regions showing the highest burden.
FIGURE 4
FIGURE 4
Comparison of Normalized Variable Importance Across Different Machine Learning Models. The bar chart presents the normalized relative importance of each lipid-related biomarker in DNN, GBM, GLM, and Random Forest models. LAP consistently ranked as the most important feature across all models, followed by LDL, CTI, TyG, and HDL, with some variation in the contribution patterns among models. Feature importance values were scaled to the most influential variable within each model.
FIGURE 5
FIGURE 5
ROC Curves for Different Machine Learning Models. Receiver Operating Characteristic (ROC) curves for five machine learning models: Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest (RF), Deep Neural Network (DNN), and Stacked Ensemble. The AUC values of the models were: GLM (0.536), DNN (0.536), GBM (0.835), RF (0.826), and Ensemble (0.850), with the Ensemble model achieving the highest discriminative performance.
FIGURE 6
FIGURE 6
Calibration curves and Brier scores of different machine learning models. Calibration curves comparing predicted probabilities versus observed proportions for five machine learning models: DL, DRF, GBM, GLM, and Stacked Ensemble. The dashed diagonal line represents perfect calibration. The Stacked Ensemble model demonstrated the best overall calibration (Brier Score = 0.13), followed by DRF and GBM (both Brier Score = 0.15), GLM (0.18), and DL (0.21).
FIGURE 7
FIGURE 7
Confusion matrix heatmaps for five machine learning models in the test set. Each panel represents the classification performance of a model: GLM (generalized linear model), GBM (gradient boosting machine), RF (random forest), DL (deep learning), and Ensemble (stacked ensemble). The x-axis shows predicted labels, and the y-axis shows actual labels. Cell values represent the number of observations classified into each category. Darker shades indicate higher counts.
FIGURE 8
FIGURE 8
SHAP Feature Importance and Beeswarm Plots Across Models. Visualization of SHAP feature importance and beeswarm plots for five machine learning models (GLM, GBM, RF, DNN, and Stacked Ensemble). The top panel shows SHAP beeswarm plots, revealing the distribution and directionality of feature contributions. The bottom panel shows mean absolute SHAP values for each variable, indicating their global importance. Across all models, LAP, TyG, and TyG-BMI consistently rank among the top predictors of knee pain.

Similar articles

References

    1. Chadha R. (2016). Revealed aspect of metabolic osteoarthritis. J. Orthop. 13 (4), 347–351. 10.1016/j.jor.2016.06.029 - DOI - PMC - PubMed
    1. Chapman K., Valdes A. M. (2012). Genetic factors in OA pathogenesis. Bone 51 (2), 258–264. 10.1016/j.bone.2011.11.026 - DOI - PubMed
    1. Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. (2002). SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. 10.1613/jair.953 - DOI
    1. Cui A., Li H., Wang D., Zhong J., Chen Y., Lu H. (2020). Global, regional prevalence, incidence and risk factors of knee osteoarthritis in population-based studies. EClinicalMedicine 29. 10.1016/j.eclinm.2020.100587 - DOI - PMC - PubMed
    1. Davis M. A. (1988). Epidemiology of osteoarthritis. Clin. Geriatric Med. 4 (2), 241–255. 10.1016/s0749-0690(18)30746-8 - DOI - PubMed

LinkOut - more resources