Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning

Carine Savalli^{1

2}, Roberta Moreira Wichmann³, Fabiano Barcellos Filho², Fernando Timoteo Fernandes⁴, Alexandre Dias Porto Chiavegatto Filho²; IACOV-BR Network

Affiliations

¹ Federal University of São Paulo, Department of Public Politics and Public Health, Santos, Brazil.
² School of Public Health, University of São Paulo, São Paulo, Brazil.
³ Brazilian Institute of Education, Development and Research-IDP, Economics Graduate Program, Brasilia, Brazil.
⁴ FIAP-Faculdade de Informática e Administração Paulista São Paulo, Brazil.

PMID: 39723970
PMCID: PMC11670925
DOI: 10.1371/journal.pdig.0000699

Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning

Carine Savalli et al. PLOS Digit Health. 2024.

. 2024 Dec 26;3(12):e0000699.

doi: 10.1371/journal.pdig.0000699. eCollection 2024 Dec.

Authors

Carine Savalli^{1

2}, Roberta Moreira Wichmann³, Fabiano Barcellos Filho², Fernando Timoteo Fernandes⁴, Alexandre Dias Porto Chiavegatto Filho²; IACOV-BR Network

Affiliations

¹ Federal University of São Paulo, Department of Public Politics and Public Health, Santos, Brazil.
² School of Public Health, University of São Paulo, São Paulo, Brazil.
³ Brazilian Institute of Education, Development and Research-IDP, Economics Graduate Program, Brasilia, Brazil.
⁴ FIAP-Faculdade de Informática e Administração Paulista São Paulo, Brazil.

PMID: 39723970
PMCID: PMC11670925
DOI: 10.1371/journal.pdig.0000699

Abstract

Machine learning (ML) is a promising tool in assisting clinical decision-making for improving diagnosis and prognosis, especially in developing regions. It is often used with large samples, aggregating data from different regions and hospitals. However, it is unclear how this affects predictions in local centers. This study aims to compare data aggregation strategies of several hospitals in Brazil with a local training strategy in each hospital to predict two COVID-19 outcomes: Intensive Care Unit admission (ICU) and mechanical ventilation use (MV). The study included 6,046 patients from 14 hospitals, with local sample sizes ranging from 47 to 1500 patients. Machine learning models were trained using extreme gradient boosting, lightGBM, and catboost for structured data. Seven data aggregation strategies based on hospital geographic regions were compared with local training, and the best strategy was determined by analyzing the area under the ROC curve (AUROC). SHAP (Shapley Additive exPlanations) values were used to assess the contribution of variables to predictions. Additionally, a metafeatures analysis examined how hospital characteristics influence the selection of the best strategy. The study found that the local training strategy was the most effective approach, in the case of ICU outcomes, for 11 of the 14 hospitals (79%), and, in the case of MV, for 10 hospitals (71%). Metafeatures analysis suggested that hospitals with smaller sample sizes generally performed better using an aggregated data strategy compared to local training. Our study brings to light an important concern about the impact of grouping data from different hospitals in predictive machine learning models. These findings contribute to the ongoing debate about the trade-off between increasing sample size and bringing together heterogeneous scenarios.

Copyright: © 2024 Savalli et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1**
Box-plots of the absolute Shapley value obtained for the 14 hospitals for (A) ICU admission and (B) mechanical ventilation use. Each graph shows the 10 variables which, on average (for the 14 hospitals), presented high contributions to predicting the outcome.

See this image and copyright information in PMC

References

1. World Health Organization. Weekly epidemiological update on COVID-19–10 August 2023. 2023. Available from: https://www.who.int/publications/m/item/weekly-epidemiological-update-on...
1. Fernandes FT, de Oliveira TA, Teixeira CE, Batista AFM, Costa GD, Chiavegatto Filho ASP. A multipurpose machine learning approach to predict COVID-19 negative prognosis in São Paulo, Brazil. Sci Rep. 2021; 11:3343. doi: 10.1038/s41598-021-82885-y - DOI - PMC - PubMed
1. Chieregato M, Frangiamore F, Morassi M, Baresi C, Nici S, Bassetti C, et al. A hybrid machine learning/deep learning COVID-19 severity predictive model from CT images and clinical data. Sci Rep. 2022; 12:4329. doi: 10.1038/s41598-022-07890-1 - DOI - PMC - PubMed
1. Sperrin M, McMillan B. Prediction models for covid-19 outcomes. BMJ. 2020; 371:m3777. doi: 10.1136/bmj.m3777 - DOI - PMC - PubMed
1. Chen R, Chen J, Yang S, Luo S, Xiao Z, Lu L, et al. Prediction of prognosis in COVID-19 patients using machine learning: A systematic review and meta-analysis. Int J Med Inform. 2023; 177:105151. doi: 10.1016/j.ijmedinf.2023.105151 - DOI - PubMed

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning

Affiliations

Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources