. 2023 Apr 13;18(4):e0284150.

doi: 10.1371/journal.pone.0284150. eCollection 2023.

Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

Miren Hayet-Otero^{1

2

3}, Fernando García-García¹, Dae-Jin Lee^{1

4}, Joaquín Martínez-Minaya⁵, Pedro Pablo España Yandiola⁶, Isabel Urrutia Landa⁷, Mónica Nieves Ermecheo^{7

8}, José María Quintana⁸, Rosario Menéndez⁹, Antoni Torres¹⁰, Rafael Zalacain Jorge¹¹, Inmaculada Arostegui^{1

12}; with the COVID-19 & Air Pollution Working Group

Affiliations

¹ Basque Center for Applied Mathematics (BCAM), Bilbao, Basque Country, Spain.
² Department of Electronic Technology, University of the Basque Country (UPV/EHU), Leioa, Basque Country, Spain.
³ Basque Research and Technology Alliance (BRTA), TECNALIA, Derio, Basque Country, Spain.
⁴ School of Science and Technology, IE University, Madrid, Madrid, Spain.
⁵ Department of Applied Statistics and Operational Research, and Quality, Universitat Politècnica de València (UPV), Valencia, Valencian Community, Spain.
⁶ Respiratory Service, Galdakao-Usansolo University Hospital, Galdakao, Basque Country, Spain.
⁷ BioCruces Bizkaia Health Research Institute, Barakaldo, Basque Country, Spain.
⁸ Research Unit, Galdakao-Usansolo University Hospital, Galdakao, Basque Country, Spain.
⁹ Pneumology Department, La Fe University and Polytechnic Hospital, Valencia, Valencian Community, Spain.
¹⁰ Pneumology Department, Hospital Clínic of Barcelona, Barcelona, Catalonia, Spain.
¹¹ Pneumology Service, Cruces University Hospital, Barakaldo, Basque Country, Spain.
¹² Department of Mathematics, University of the Basque Country (UPV/EHU), Leioa, Basque Country, Spain.

PMID: 37053151
PMCID: PMC10101453
DOI: 10.1371/journal.pone.0284150

Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

Miren Hayet-Otero et al. PLoS One. 2023.

. 2023 Apr 13;18(4):e0284150.

doi: 10.1371/journal.pone.0284150. eCollection 2023.

Authors

Affiliations

¹ Basque Center for Applied Mathematics (BCAM), Bilbao, Basque Country, Spain.
² Department of Electronic Technology, University of the Basque Country (UPV/EHU), Leioa, Basque Country, Spain.
³ Basque Research and Technology Alliance (BRTA), TECNALIA, Derio, Basque Country, Spain.
⁴ School of Science and Technology, IE University, Madrid, Madrid, Spain.
⁵ Department of Applied Statistics and Operational Research, and Quality, Universitat Politècnica de València (UPV), Valencia, Valencian Community, Spain.
⁶ Respiratory Service, Galdakao-Usansolo University Hospital, Galdakao, Basque Country, Spain.
⁷ BioCruces Bizkaia Health Research Institute, Barakaldo, Basque Country, Spain.
⁸ Research Unit, Galdakao-Usansolo University Hospital, Galdakao, Basque Country, Spain.
⁹ Pneumology Department, La Fe University and Polytechnic Hospital, Valencia, Valencian Community, Spain.
¹⁰ Pneumology Department, Hospital Clínic of Barcelona, Barcelona, Catalonia, Spain.
¹¹ Pneumology Service, Cruces University Hospital, Barakaldo, Basque Country, Spain.
¹² Department of Mathematics, University of the Basque Country (UPV/EHU), Leioa, Basque Country, Spain.

PMID: 37053151
PMCID: PMC10101453
DOI: 10.1371/journal.pone.0284150

Abstract

With the COVID-19 pandemic having caused unprecedented numbers of infections and deaths, large research efforts have been undertaken to increase our understanding of the disease and the factors which determine diverse clinical evolutions. Here we focused on a fully data-driven exploration regarding which factors (clinical or otherwise) were most informative for SARS-CoV-2 pneumonia severity prediction via machine learning (ML). In particular, feature selection techniques (FS), designed to reduce the dimensionality of data, allowed us to characterize which of our variables were the most useful for ML prognosis. We conducted a multi-centre clinical study, enrolling n = 1548 patients hospitalized due to SARS-CoV-2 pneumonia: where 792, 238, and 598 patients experienced low, medium and high-severity evolutions, respectively. Up to 106 patient-specific clinical variables were collected at admission, although 14 of them had to be discarded for containing ⩾60% missing values. Alongside 7 socioeconomic attributes and 32 exposures to air pollution (chronic and acute), these became d = 148 features after variable encoding. We addressed this ordinal classification problem both as a ML classification and regression task. Two imputation techniques for missing data were explored, along with a total of 166 unique FS algorithm configurations: 46 filters, 100 wrappers and 20 embeddeds. Of these, 21 setups achieved satisfactory bootstrap stability (⩾0.70) with reasonable computation times: 16 filters, 2 wrappers, and 3 embeddeds. The subsets of features selected by each technique showed modest Jaccard similarities across them. However, they consistently pointed out the importance of certain explanatory variables. Namely: patient's C-reactive protein (CRP), pneumonia severity index (PSI), respiratory rate (RR) and oxygen levels -saturation Sp O2, quotients Sp O2/RR and arterial Sat O2/Fi O2-, the neutrophil-to-lymphocyte ratio (NLR) -to certain extent, also neutrophil and lymphocyte counts separately-, lactate dehydrogenase (LDH), and procalcitonin (PCT) levels in blood. A remarkable agreement has been found a posteriori between our strategy and independent clinical research works investigating risk factors for COVID-19 severity. Hence, these findings stress the suitability of this type of fully data-driven approaches for knowledge extraction, as a complementary to clinical perspectives.

Copyright: © 2023 Hayet-Otero et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Flow chart for the included and excluded variables, and feature encoding.**

**Fig 2. Jaccard similarity index between feature subsets.**
For all pairs of stable algorithms, these grouped by n_FS specification. Results were averaged over M = 100 bootstrap samples. (a) n_FS = 5. (b) n_FS = 10. (c) n_FS = 20. (d) n_FS = 40. (e) n_FS not pre-fixed.

**Fig 3. Features selected in ⩾80% cases by the stable MI filters: n_FS = 20 or 40.**
(a) MI Classif—knn imputer: n_FS = 20. (b) MI Regress—knn imputer: n_FS = 20. (c) MI Classif—iterat imputer: n_FS = 20. (d) MI Regress—iterat imputer: n_FS = 20. (e) MI Classif—knn imputer: n_FS = 40. (f) MI Regress—knn imputer: n_FS = 40. (g) MI Classif—iterat imputer: n_FS = 40. (h) MI Regress—iterat imputer: n_FS = 40.

**Fig 4. Features selected in ⩾80% cases by the stable RBA filters: All of them without imputation.**
(a) ReliefF (k = 1 00): n_FS = 5. (b) MultiSURF: n_FS = 5. (c) ReliefF (k = 100): n_FS = 10. (d) MultiSURF: n_FS = 10. (e) ReliefF (k = 100): n_FS = 20. (f) MultiSURF; n_FS = 20. (g) ReliefF (k = 100): n_FS = 40. (h) MultiSURF: n_FS = 40.

**Fig 5. Features selected in ⩾80% cases by the stable RFE wrappers (*a,b*) and embeddeds (*c–e*): All of them with the knn imputer.**
(a) RFE: n_FS = 5. (b) RFE: n_FS = 20. (c) L¹-LR: C = 0.005. (d) Lasso: α = 0.050. (e) Lasso: α = 0.075.

See this image and copyright information in PMC

Cited by

Obtaining patient phenotypes in SARS-CoV-2 pneumonia, and their association with clinical severity and mortality.
García-García F, Lee DJ, Nieves-Ermecheo M, Bronte O, España PP, Quintana JM, Menéndez R, Torres A, Ruiz Iturriaga LA, Urrutia I; COVID-19 & Air Pollution Working Group. García-García F, et al. Pneumonia (Nathan). 2024 Jun 25;16(1):12. doi: 10.1186/s41479-024-00132-0. Pneumonia (Nathan). 2024. PMID: 38915125 Free PMC article.
Comparative analysis of feature selection techniques for COVID-19 dataset.
Mohtasham F, Pourhoseingholi M, Hashemi Nazari SS, Kavousi K, Zali MR. Mohtasham F, et al. Sci Rep. 2024 Aug 11;14(1):18627. doi: 10.1038/s41598-024-69209-6. Sci Rep. 2024. PMID: 39128991 Free PMC article.

References

1. Wynants L, van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al.. Prediction models for diagnosis and prognosis of COVID-19: Systematic review and critical appraisal. BMJ. 2020;369. doi: 10.1136/bmj.m1328 - DOI - PMC - PubMed
1. Alballa N, Al-Turaiki I. Machine learning approaches in COVID-19 diagnosis, mortality, and severity risk prediction: a review. Informatics in Medicine Unlocked. 2021;24. doi: 10.1016/j.imu.2021.100564 - DOI - PMC - PubMed
1. Mann S, Berdahl CT, Baker L, Girosi F. Artificial intelligence applications used in the clinical response to COVID-19: A scoping review. PLOS Digital Health. 2022;1(10). doi: 10.1371/journal.pdig.0000132 - DOI - PMC - PubMed
1. Cecconi M, Piovani D, Brunetta E, Aghemo A, Greco M, Ciccarelli M, et al.. Early predictors of clinical deterioration in a cohort of 239 patients hospitalized for COVID-19 infection in Lombardy, Italy. J Clin Med. 2020;9(5):1548. doi: 10.3390/jcm9051548 - DOI - PMC - PubMed
1. Gong J, Ou J, Qiu X, Jie Y, Chen Y, Yuan L, et al.. A tool for early prediction of severe coronavirus disease 2019 (COVID-19): A multicenter study using the risk nomogram in Wuhan and Guangdong, China. Clin Infect Dis. 2020;71(15):833–840. doi: 10.1093/cid/ciaa443 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

Affiliations

Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous