Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 17;25(1):83.
doi: 10.1186/s12911-025-02903-1.

Prediction of depressive disorder using machine learning approaches: findings from the NHANES

Affiliations

Prediction of depressive disorder using machine learning approaches: findings from the NHANES

Thien Vu et al. BMC Med Inform Decis Mak. .

Abstract

Background: Depressive disorder, particularly major depressive disorder (MDD), significantly impact individuals and society. Traditional analysis methods often suffer from subjectivity and may not capture complex, non-linear relationships between risk factors. Machine learning (ML) offers a data-driven approach to predict and diagnose depression more accurately by analyzing large and complex datasets.

Methods: This study utilized data from the National Health and Nutrition Examination Survey (NHANES) 2013-2014 to predict depression using six supervised ML models: Logistic Regression, Random Forest, Naive Bayes, Support Vector Machine (SVM), Extreme Gradient Boost (XGBoost), and Light Gradient Boosting Machine (LightGBM). Depression was assessed using the Patient Health Questionnaire (PHQ-9), with a score of 10 or higher indicating moderate to severe depression. The dataset was split into training and testing sets (80% and 20%, respectively), and model performance was evaluated using accuracy, sensitivity, specificity, precision, AUC, and F1 score. SHAP (SHapley Additive exPlanations) values were used to identify the critical risk factors and interpret the contributions of each feature to the prediction.

Results: XGBoost was identified as the best-performing model, achieving the highest accuracy, sensitivity, specificity, precision, AUC, and F1 score. SHAP analysis highlighted the most significant predictors of depression: the ratio family income to poverty (PIR), sex, hypertension, serum cotinine and hydroxycotine, BMI, education level, glucose levels, age, marital status, and renal function (eGFR).

Conclusion: We developed ML models to predict depression and utilized SHAP for interpretation. This approach identifies key factors associated with depression, encompassing socioeconomic, demographic, and health-related aspects.

Keywords: Depression; Depressive disorder; Light Gradient Boosted Machine (Light-GBM); Logistic regression; Naïve bayes; Random forest; Shapley Addictive exPlanations (SHAP); Supervised machine learning; Support Vector Machine (SVM); eXtreme Gradient Boost (XGBoost).

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Ethics approval for this study was granted by the National Centre for Health Statistics Research Ethics Review Board (Protocol # 2013-14). Since this study involves secondary data analysis, the original informed consent provided during primary data collection included permission for secondary use, eliminating the need for additional participant consent. Participants’ privacy was protected by anonymizing or de-identifying the data to prevent identification. Further details on NHANES ethics approval are available on the CDC’s official website: https://www.cdc.gov/nchs/nhanes/about/erb.html?CDC_AAref_Val=https://www.cdc.gov/nchs/nhanes/irba98.htm . Consent for publication: Not applicable. Relevant guidelines and regulations: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The contribution levels of all variables to depression based on SHAP values The global significance of each feature in the model is illustrated in the SHAP (blue) bar plot. It provides an overview of the features’ impact on the model’s output by displaying the mean absolute SHAP value for each feature. A feature (variable) is represented by each bar in the plot, and the length of the bar indicates the extent of the feature’s contribution to Depression
Fig. 2
Fig. 2
The heat plot on SHAP values The relationships between the feature (variable) and Depression are revealed by the heat plot of SHAP values. The relationship between the value of a specific feature and its impact on prediction can be fundamentally understood through this. Each data point is associated with a specific participant and their corresponding Shapley value for a specific feature. The Shapley value, which is represented on the x-axis, and the feature’s prominence, which is represented on the y-axis, determine the position of a data point on this plot
Fig. 3
Fig. 3
The impact of categorical variables on depression
Fig. 4
Fig. 4
The impact of numerical variables on depression

References

    1. Steger MF, Kashdan TB. Depression and everyday social activity, belonging, and well-being. J Couns Psychol. 2009;56(2):289–300. 10.1037/a0015416. - PMC - PubMed
    1. Santomauro DF et al. Nov., Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic, The Lancet, vol. 398, no. 10312, pp. 1700–1712, 2021, 10.1016/S0140-6736(21)02143-7 - PMC - PubMed
    1. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/depression.
    1. Reddy MS. Depression: the disorder and the Burden. Indian J Psychol Med. Jan. 2010;32(1):1–2. 10.4103/0253-7176.70510. - PMC - PubMed
    1. Vu T, et al. Machine learning approaches for stroke risk prediction: findings from the Suita Study. J Cardiovasc Dev Dis. Jul. 2024;11:207. 10.3390/jcdd11070207. - PMC - PubMed

LinkOut - more resources