Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Dec 21:2025.10.13.25337771.
doi: 10.1101/2025.10.13.25337771.

Machine learning-optimized perinatal depression screening: Maximum impact, minimal burden

Affiliations

Machine learning-optimized perinatal depression screening: Maximum impact, minimal burden

Eric Hurwitz et al. medRxiv. .

Abstract

Introduction: Perinatal depression affects up to 30% of pregnant and postpartum women, which has increased since the COVID-19 pandemic, making rapidly identifying affected women a high clinical priority. While screening tools like the Edinburgh Postnatal Depression Scale (EPDS) are widely used, brevity is important for busy clinical practice to reduce administration time and patient burden. Current methods to shorten assessments rely on traditional psychometric approaches, rather than machine learning (ML) methods that could optimize predictive accuracy.

Methods: We developed a ML framework using National Clinical Cohort Collaborative (N3C) data to predict full 10-item EPDS scores from shortened question subsets (n=22,924). We evaluated all 2-5 item combinations using linear regression, validating performance across multiple cohorts including postpartum women (n=7,750) and an external non-N3C pregnancy population (n=1,217). For additional validation, we applied our approach to the PHQ-9 (n=398,606) to test generalizability. Binary classification models using clinical thresholds (≥13) determined EPDS screening accuracy. Decision curve analysis was performed to assess the clinical utility of our ML method.

Results: The optimal 2-question EPDS combinations Q4+Q8 (anxiety/sadness) and Q5+Q8 (scared/sadness) both achieved R2=0.70. Binary classification demonstrated strong performance (sensitivity=0.68-0.72, specificity=0.98-0.99). The framework generalized across postpartum subsets, external pregnancy cohorts, and PHQ-9 validation (R2=0.64-0.73). Adding covariates did not improve performance. Decision curve analysis showed our ML approach had superior clinical benefit (0.01-0.03) versus traditional additive scoring.

Conclusion/implications: Our ML framework suggests a reduced assessment burden with two EPDS questions maintains predictive accuracy as the full-item EPDS. With ~3.6 million annual U.S. births, this approach could identify additional positive perinatal depression screens, enhancing screening implementation across clinical settings.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. A schematic displaying our ML method with 2-item subsets.
Our three-step machine learning (ML) framework identifies the most predictive question subsets within mental health assessments (e.g., the EPDS). First, we gather item-level responses to a mental health assessment and calculate the total score by summing the individual responses (Step 1). Next, we develop ML models using each permutation of two items as inputs to predict the remaining score (total score minus the two input items) (Step 2). Finally, we compare performance metrics across all models to identify which two-item combination most accurately predicts the remaining score, effectively determining the optimal brief assessment version (Step 3).
Figure 2:
Figure 2:. EPDS data availability in N3C over time.
The total number of EPDS assessments in N3C by month from January 2018 to October 2024 in N3C. Data points demonstrate an upward trend over time and are colored according to the number of contributing data partner sites. The vertical line at March 2020 represents the beginning of the COVID-19 pandemic.
Figure 3:
Figure 3:. Increased individual question responses are associated with increased EPDS total scores.
The average EPDS total score based on responses to individual questions (Q1-Q10). The left panel shows overall scores by response to all questions, while the right panel breaks down scores by specific question. Four response categories are shown: “As much as I always could” (pink), “Not quite so much now” (blue), “Definitely not so much now” (green), and “Not at all” (purple). Data are expressed as mean and standard deviation. Statistical comparisons between each pair of response categories demonstrated significant differences in total EPDS scores across all comparisons (P<0.001). The horizontal dashed line represents a clinical cutoff score of 13 for positive depression screening.
Figure 4:
Figure 4:. Q4+Q8 and Q5+Q8 models displayed the highest performance for predicting the remaining EPDS total score.
Model performance metrics across the top five different models (represented by colored bars) using 2-item EPDS question subsets to predict the EPDS total score from the entire cohort of women who took the EPDS in N3C. The top panel displays R2 values (coefficient of determination), the middle panel shows RMSE, while the bottom panel shows MAE. The dot-and-line diagram at the bottom indicates which variables (Q1-Q10) were included in each model configuration, with blue dots representing included variables connected by vertical lines.
Figure 5:
Figure 5:. Q4:anxious + Q8:sad and Q5:scared + Q8:sad models displayed the highest performance for predicting the remaining EPDS total score in a postpartum cohort.
Model performance metrics across the top five different models (represented by colored bars) using 2-item EPDS question subsets to predict the EPDS total score among women in the postpartum period in N3C (left) and Wash U cohort (right). The top panel displays R2 values (coefficient of determination), the middle panel shows RMSE, while the bottom panel shows MAE. The dot-and-line diagram at the bottom indicates which variables (Q1-Q10) were included in each model configuration, with blue dots representing included variables connected by vertical lines.
Figure 6:
Figure 6:. Adding additional questions improved model performance predicting the remaining total score of psychometric assessments.
* = P < 0.05, ** = P < 0.01, *** = P < 0.001 after Bonferroni correction Box-plots showing model performance metrics for predicting the total scores using shortened versions of the EPDS (left) and PHQ-9 (right) questionnaires (represented by colored bars) from women who took the EPDS and all individuals who took the PHQ-9 in N3C. Each gray dot represents an individual ML model testing a different combination of 2, 3, 4, or 5 questions to predict the total score for the EPDS and PHQ-9. The top panel displays R2 values (coefficient of determination), the middle panel shows RMSE, while the bottom panel shows MAE. Overall, both EPDS and PHQ-9 assessments demonstrated improved model performance with additional questions, evidenced by higher R2 values and lower RMSE and MAE.
Figure 7:
Figure 7:. Adding covariates to ML models did not improve EPDS total score prediction performance.
ns = not significant Box plots showing model performance metrics (R2, RMSE, MAE) across all 2-question EPDS combinations in the N3C postpartum cohort. Models were evaluated with: no covariates (None), demographics only, pregnancy outcomes only, mental health history only, and all covariates combined (All 3). Each gray dot represents an individual 2-question model. Results demonstrate that predictive performance was statistically equivalent across all covariate combinations, with no significant pairwise differences observed (all P>0.05).
Figure 8:
Figure 8:. Our ML approach demonstrated superior clinical utility across question combinations for PPD screening.
A. Performance comparison of EPDS question combinations using different screening methods. ML-based approaches achieved higher F1 scores, kappa values, and precision compared to the other methods. B. Decision curve analysis evaluating clinical utility of different EPDS question combinations and screening approaches. Our ML method demonstrated superior net benefit over traditional approaches, with the greatest advantage observed at lower threshold probabilities. Decision curves for individual methods are also presented in Figure S5 for improved visibility.

References

    1. Kroenke K., Spitzer R. L. & Williams J. B. The PHQ-9: validity of a brief depression severity measure. J. Gen. Intern. Med. 16, 606–613 (2001). - PMC - PubMed
    1. Berwick D. M. et al. Performance of a five-item mental health screening test. Med. Care 29, 169–176 (1991). - PubMed
    1. Gonzalez O. Psychometric and machine learning approaches to reduce the length of scales. Multivariate Behav. Res. 56, 903–919 (2021). - PMC - PubMed
    1. Löwe B., Kroenke K. & Gräfe K. Detecting and monitoring depression with a two-item questionnaire (PHQ-2). J. Psychosom. Res. 58, 163–171 (2005). - PubMed
    1. Kroenke K., Spitzer R. L., Williams J. B. W., Monahan P. O. & Löwe B. Anxiety disorders in primary care: prevalence, impairment, comorbidity, and detection. Ann. Intern. Med. 146, 317–325 (2007). - PubMed

Publication types

LinkOut - more resources