Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 7;104(6):e41370.
doi: 10.1097/MD.0000000000041370.

Building a cancer risk and survival prediction model based on social determinants of health combined with machine learning: A NHANES 1999 to 2018 retrospective cohort study

Affiliations

Building a cancer risk and survival prediction model based on social determinants of health combined with machine learning: A NHANES 1999 to 2018 retrospective cohort study

Shiqi Zhang et al. Medicine (Baltimore). .

Abstract

The occurrence and progression of cancer is a significant focus of research worldwide, often accompanied by a prolonged disease course. Concurrently, researchers have identified that social determinants of health (SDOH) (employment status, family income and poverty ratio, food security, education level, access to healthcare services, health insurance, housing conditions, and marital status) are associated with the progression of many chronic diseases. However, there is a paucity of research examining the influence of SDOH on cancer incidence risk and the survival of cancer survivors. The aim of this study was to utilize SDOH as a primary predictive factor, integrated with machine learning models, to forecast both cancer risk and prognostic survival. This research is grounded in the SDOH data derived from the National Health and Nutrition Examination Survey dataset spanning 1999 to 2018. It employs methodologies including adaptive boosting, gradient boosting machine (GradientBoosting), random forest (RF), extreme gradient boosting, light gradient boosting machine, support vector machine, and logistic regression to develop models for predicting cancer risk and prognostic survival. The hyperparameters of these models-specifically, the number of estimators (100-200), maximum tree depth (10), learning rate (0.01-0.2), and regularization parameters-were optimized through grid search and cross-validation, followed by performance evaluation. Shapley Additive exPlanations plots were generated to visualize the influence of each feature. RF was the best model for predicting cancer risk (area under the curve: 0.92, accuracy: 0.84). Age, non-Hispanic White, sex, and housing status were the 4 most important characteristics of the RF model. Age, gender, employment status, and household income/poverty ratio were the 4 most important features in the gradient boosting machine model. The predictive models developed in this study exhibited strong performance in estimating cancer incidence risk and survival time, identifying several factors that significantly influence both cancer incidence risk and survival, thereby providing new evidence for cancer management. Despite the promising findings, this study acknowledges certain limitations, including the omission of risk factors in the cancer survivor survival model and potential biases inherent in the National Health and Nutrition Examination Survey dataset. Future research is warranted to further validate the model using external datasets.

PubMed Disclaimer

Conflict of interest statement

The authors have no funding and conflicts of interest to disclose.

Figures

Figure 1.
Figure 1.
ROC curve of the cancer risk model. ROC curves of various machine learning models used to predict cancer risk. The models included are AdaBoost, gradient boosting, LightGBM, random forest, SVM, XGBoost, and logistic regression. The AUC values for each model are presented in the legend, with random forest and gradient boosting achieving the highest AUCs of 0.92, indicating superior performance in predicting cancer risk. The diagonal dashed line represents the performance of a random classifier (AUC = 0.5) for comparison. AdaBoost = adaptive boosting, AUC = area under the curve, LightGBM = light gradient boosting machine, ROC = receiver operating characteristic, SVM = support vector machine, XGBoost = extreme gradient boosting.
Figure 2.
Figure 2.
DCA curve of the cancer risk model. DCA curves for various machine learning models predicting cancer risk. The models compared include AdaBoost, gradient boosting, LightGBM, random forest, SVM, XGBoost, and logistic regression. The y-axis represents the net benefit, and the x-axis indicates the threshold probability. The “Treat all” (solid line) and “Treat none” (dotted line) strategies are included for reference, representing the net benefit of treating all or no patients, respectively, across different threshold probabilities. A model that achieves a higher net benefit across a range of threshold probabilities demonstrates better clinical usefulness. In this analysis, gradient boosting and random forest models show relatively higher net benefits at various threshold probabilities, indicating their potential utility in clinical decision-making for cancer risk assessment. AdaBoost = adaptive boosting, DCA = decision curve analysis, LightGBM = light gradient boosting machine, SVM = support vector machine, XGBoost = extreme gradient boosting.
Figure 3.
Figure 3.
Plot of SHAP values for cancer risk. SHapley additive explanation values for the various features in the cancer risk prediction model, showing the effect of each feature on the model output. The y-axis lists characteristics such as age, race, sex, and socioeconomic indicators such as insurance, housing, and food security. The x-axis represents SHAP values, indicating the magnitude and direction of the impact of each feature on cancer risk prediction. Each point represents a separate data point and is colored to indicate the feature values (pink high and blue low). Features with higher SHAP values have a greater impact on model predictions. Among them, age seems to be a prominent factor, and a higher value of age is helpful for cancer risk prediction. This figure reveals significant effects of age, race, especially non-Hispanic Whites, and socioeconomic factors such as gender and housing situation on cancer risk. SHAP = Shapley Additive exPlanations.
Figure 4.
Figure 4.
ROC curves for predictive models at different time points. ROC curves for multiple predictive models were evaluated at 3 different time points: 43, 87, and 174 mo. The curves illustrate the models’ sensitivity and specificity in predicting outcomes over time. (A) ROC curves at 43 months for the LightGBM (red), random survival forest (yellow), SVM (green), XGBoost (blue), and GBM (purple) models. The curves demonstrate the performance of each model, highlighting LightGBM’s sensitivity at lower specificity. (B) ROC curves at 87 mo, maintaining the same color coding for each model. (C) ROC curves at 174 mo, again using the same color scheme. GBM = gradient boosting machine, ROC = receiver operating characteristic, LightGBM = light gradient boosting machine, SVM = support vector machine, XGBoost = extreme gradient boosting.
Figure 5.
Figure 5.
DCA for different models and time periods. DCA of various prediction models applied to survival data at different time points. Each figure illustrates the net benefit of the respective model compared to the all-or-nothing strategy over the threshold range. (A–C) DCA results for GBM at 43, 87, and 174 mo, respectively. The net benefit is represented in purple, showing the performance of the GBM model over time. (D–F) DCA results for LightGBM at 43, 87, and 174 mo, respectively. The net benefit is indicated in red, with the model’s efficacy evaluated across different risk thresholds. (G–I) DCA results for SVM at 43, 87, and 174 mo, respectively. The net benefit is depicted in blue, illustrating the model’s performance in predicting outcomes over time. DCA = decision curve analysis, GBM = gradient boosting machine, LightGBM = light gradient boosting machine, SVM = support vector machine.
Figure 6.
Figure 6.
DCA for XGBoost and random survival forest models. DCA results for XGBoost and random survival forest models at various time intervals, evaluating their clinical utility in predicting survival outcomes. (A–C) DCA results for the XGBoost model at 43, 87, and 174 mo, respectively. The net benefit is shown in green, indicating the model’s performance across different high-risk thresholds. The “None” strategy is represented as a reference line for comparison. (D–F) DCA results for the random survival forest model at 43, 57, and 174 mo, respectively. The net benefit is depicted in yellow, demonstrating how the model performs over time against the “None” strategy and an “All” strategy for reference. DCA = decision curve analysis, XGBoost = extreme gradient boosting.
Figure 7.
Figure 7.
SHAP values for features impacting cancer prognosis. SHAP values, which indicate the impact of different features on the model’s output regarding cancer prognosis. (A) The distribution of SHAP values for the feature “age.” The color gradient from blue to red signifies the feature value, where blue represents lower age values and red indicates higher age values. The x-axis represents the SHAP values, showing the extent to which age influences the model’s predictions, with higher values correlating with a more significant positive impact on cancer prognosis. (B) SHAP values for multiple features impacting cancer prognosis. BMI = body mass index, PIR = poverty-income ratio, SHAP = Shapley Additive exPlanations.
Figure 8.
Figure 8.
Radar chart of social determinants of health by race. The radar chart compares the social determinants of health between Whites and non-Whites across various factors. Each axis represents a specific determinant, allowing for a visual assessment of disparities between the two groups. BMI = body mass index.
Figure 9.
Figure 9.
Dose–response curve. The curve reflects the cumulative effect of adverse SDOH on cancer risk. SDOH = social determinants of health.

Similar articles

Cited by

References

    1. Bray F, Laversanne M, Sung H, et al. . Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2024;74:229–63. - PubMed
    1. Siegel RL, Giaquinto AN, Jemal A. Cancer statistics, 2024. CA Cancer J Clin. 2024;74:12–49. - PubMed
    1. Sung H, Siegel RL, Rosenberg PS, Jemal A. Emerging cancer trends among young adults in the USA: analysis of a population-based cancer registry. Lancet Public Health. 2019;4:e137–47. - PubMed
    1. Henley SJ, Ward EM, Scott S, et al. . Annual report to the nation on the status of cancer, part I: national cancer statistics. Cancer. 2020;126:2225–49. - PMC - PubMed
    1. Syrnioti G, Eden CM, Johnson JA, Alston C, Syrnioti A, Newman LA. Social determinants of cancer disparities. Ann Surg Oncol. 2023;30:8094–104. - PubMed