This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2026 Jan 30:2026.01.28.26344858.

doi: 10.64898/2026.01.28.26344858.

Developing and externally validating machine learning models to forecast short-term risk of ventilator-associated pneumonia

Alec K Peltekian¹, Wan-Ting Liao², Vijeeth Guggilla³, Nikolay Markov², Karolina Senkow², Zewei Liao⁴, Marjorie Kang², Luke Rasmussen³, Elsa Tavernier^{5

6}, Stephan Ehrmann^{7

8}, Rebecca K Clepp², Thomas Stoeger^{2

9

10

11}, Theresa Walunas³, Alok Choudhary^{1

12}, Alexander V Misharin^{2

10}, Benjamin D Singer^{2

10}, Gr Scott Budinger^{2

10}, Richard G Wunderink^{2

10}, Catherine A Gao², Ankit Agrawal^{1

2}; NU SCRIPT Study Investigators

Affiliations

¹ Department of Computer Science, Northwestern University McCormick School of Engineering and Applied Science, Chicago, IL, USA.
² Division of Pulmonary and Critical Care, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
³ Division of Biostatistics and Informatics, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
⁴ Division of Pulmonary and Critical Care, Department of Medicine, University of Chicago, Chicago, IL, USA.
⁵ CIC, INSERM 1415, CHRU Tours, Tours, France.
⁶ Methods in Patients-Centered Outcomes and Health Research SPHERE, INSERM 1246, Tours and Nantes, France.
⁷ Médecine Intensive Réanimation, INSERM CIC1415, CRICS-TriggerSEP F-CRIN researche network, CHRU de Tours, Tours, France.
⁸ Centre d'étude des pathologies respiratoires, INSERM U1100, Université de Tours, Tours, France.
⁹ The Potocsnak Longevity Institute, Northwestern University, Chicago, IL, USA.
¹⁰ Simpson Querrey Lung Institute for Translational Science, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
¹¹ NSF-Simons National Institute for Theory and Mathematics in Biology, Chicago, IL, USA.
¹² Department of Electrical and Computer Engineering, Northwestern University McCormick School of Engineering and Applied Science, Chicago, IL, USA.

PMID: 41646725
PMCID: PMC12870606
DOI: 10.64898/2026.01.28.26344858

Developing and externally validating machine learning models to forecast short-term risk of ventilator-associated pneumonia

Alec K Peltekian et al. medRxiv. 2026.

[Preprint]. 2026 Jan 30:2026.01.28.26344858.

doi: 10.64898/2026.01.28.26344858.

Authors

Affiliations

¹ Department of Computer Science, Northwestern University McCormick School of Engineering and Applied Science, Chicago, IL, USA.
² Division of Pulmonary and Critical Care, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
³ Division of Biostatistics and Informatics, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
⁴ Division of Pulmonary and Critical Care, Department of Medicine, University of Chicago, Chicago, IL, USA.
⁵ CIC, INSERM 1415, CHRU Tours, Tours, France.
⁶ Methods in Patients-Centered Outcomes and Health Research SPHERE, INSERM 1246, Tours and Nantes, France.
⁷ Médecine Intensive Réanimation, INSERM CIC1415, CRICS-TriggerSEP F-CRIN researche network, CHRU de Tours, Tours, France.
⁸ Centre d'étude des pathologies respiratoires, INSERM U1100, Université de Tours, Tours, France.
⁹ The Potocsnak Longevity Institute, Northwestern University, Chicago, IL, USA.
¹⁰ Simpson Querrey Lung Institute for Translational Science, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
¹¹ NSF-Simons National Institute for Theory and Mathematics in Biology, Chicago, IL, USA.
¹² Department of Electrical and Computer Engineering, Northwestern University McCormick School of Engineering and Applied Science, Chicago, IL, USA.

PMID: 41646725
PMCID: PMC12870606
DOI: 10.64898/2026.01.28.26344858

Abstract

Purpose: Ventilator-associated pneumonia (VAP) remains one of the most serious hospital-acquired infections in the intensive care unit (ICU), with high morbidity and mortality. Early identification of patients at risk for developing VAP could enable timely diagnostics and intervention. However, current clinical tools are limited in their ability to detect early physiologic signals preceding VAP onset. We aimed to build supervised machine learning models to predict short term onset of VAP.

Methods: We analyzed electronic health record data from a prospective observational cohort of ICU patients, where VAP was adjudicated using a standardized published protocol by a panel of critical care physicians. Clinical features (including vital signs, ventilator settings, laboratory values, and support devices) were extracted for each patient-ICU-day. We explored unsupervised clustering to characterize feature dynamics associated with VAP onset. We built multiple machine learning models across different prediction windows (3, 5, 7 days before VAP). We examined model performance in two external cohorts, MIMIC-IV and secondary analysis of the AMIKINHAL trial. Results were evaluated with discrimination metrics such as AUROC.

Results: The internal cohort included 507 patients with BAL-confirmed diagnoses: 261 developed VAP and 246 did not have VAP. Visualization using clustering identified distinct physiologic states enriched for VAP-labeled days. The best-performing model achieved an AUROC of 0.866 in predicting VAP up to seven days before clinical diagnosis. Temporal model probability trajectories showed rising model confidence in the days leading up to VAP. On external validation in MIMIC-IV, the best model achieved an AUROC of 0.817 for forecasting VAP within five days. There was low feature overlap with the AMIKINHAL trial data, leading to poor model performance. Feature analysis revealed that platelet count, positive end-expiratory pressure (PEEP), ventilator duration, and inflammatory markers were key drivers of model predictions.

Conclusions: Machine learning models trained on routinely collected ICU data with careful labeling can anticipate VAP onset up to a week in advance with strong predictive performance. Model performance generalized to data from an entirely different hospital system despite differences in practice and labeling patterns, but did not perform well when there was poor feature overlap. Future work should focus on real-time prospective evaluation.

Keywords: machine learning; mechanical ventilation; ventilator-associated pneumonia.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest B.D.S. holds United States Patent No. US 10,905,706 B2, Compositions and Methods to Accelerate Resolution of Acute Lung Inflammation, and serves on the Scientific Advisory Board of Zoe Biosciences. S.E. reports relationships with Aerogen Ltd., Fisher & Paykel, and JIB. All other authors declare no competing interests.

Figures

**Figure 1:. Overview of the machine learning pipeline for ventilator-associated pneumonia (VAP) onset prediction.**
(A) Cohort derivation and labeling from BAL-confirmed VAP or non-VAP cases with 3-, 5-, and 7-day windows. (B) Data preprocessing with feature extraction, normalization, and missingness handling. (C) Model development using logistic regression, random forest, and XGBoost with Optuna tuning. (D) Temporal splitting and evaluation: 10-fold patient-grouped cross-validation on training set with hyperparameter optimization. Temporal train/test split (90/10; test = most recent patients). Within training, 10-fold GroupKFold CV prevents patient leakage. Each fold optimizes hyperparameters (25 Optuna trials) and evaluates on fold validation data. The hyperparameter set achieving the highest mean validation AUC across all folds is selected, and a final model is trained on all training data using these selected hyperparameters and evaluated on the temporal test set.

**Figure 2:. Unsupervised clustering reveals physiologic states associated with impending VAP.**
(A) Similarity-based clustering of patient-days over a 5-day prediction window identifies distinct physiologic clusters, with VAP-labeled days enriched in clusters marked by heightened inflammation and increased ventilatory support. (B) Aggregated temporal trajectories of key clinical variables in the five days preceding BAL-confirmed VAP demonstrate coordinated shifts across inflammatory, hemodynamic, and metabolic features. (C) UMAP projection of patient-day feature vectors shows spatial concentration of VAP days within specific regions of the embedded space, despite pneumonia labels not being used during clustering.

**Figure 3:. Model interpretability and temporal risk dynamics for VAP prediction.**
(A) SHAP summary plot for the XGBoost model predicting VAP within five days, illustrating the direction and magnitude of feature contributions across patient-days. Higher values of ventilatory support variables (e.g., days on ventilator, plateau pressure) and impaired neurologic status were associated with increased predicted VAP risk. (B) Feature importance rankings from a Random Forest model trained on the same prediction task, demonstrating similar prioritization. (C) Temporal evolution of predicted VAP risk probabilities from Day −5 to Day −1 prior to diagnosis.

See this image and copyright information in PMC

References

1. Kohbodi G. A., et al. Venkat Rajasurya, and Asif Noor. Ventilator-Associated Pneumonia. StatPearls Publishing, Treasure Island (FL), 2018. URL https://www.ncbi.nlm.nih.gov/books/NBK507711/. StatPearls [Internet].
1. Howroyd Fiona, Chacko Cyril, Andrew MacDuff Nandan Gautam, Pouchet Brian, Tunnicliffe Bill, Weblin Jonathan, Fang Gao-Smith Zubair Ahmed, Niharika A Duggal, et al. Ventilator-associated pneumonia: pathobiological heterogeneity and diagnostic challenges. Nature communications, 15(1):6447, 2024.
1. Koenig Steven Mand Truwit Jonathon D. et al. Ventilator-associated pneumonia: diagnosis, treatment, and prevention. Clinical microbiology reviews, 19(4):637–657, 2006. - PMC - PubMed
1. Zilberberg Marya D and Shorr Andrew F. et al. Ventilator-associated pneumonia: the clinical pulmonary infection score as a surrogate for diagnostics and outcome. Clinical infectious diseases, 51(Supplement 1):S131–S135, 2010. - PubMed
1. Ehrmann S. et al. Inhaled amikacin to prevent ventilator-associated pneumonia. New England Journal of Medicine, 389:2052–2062, 2023. doi: 10.1056/NEJMoa2305958. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Developing and externally validating machine learning models to forecast short-term risk of ventilator-associated pneumonia

Affiliations

Developing and externally validating machine learning models to forecast short-term risk of ventilator-associated pneumonia

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources