Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 18;26(2):333-353.
doi: 10.17305/bb.2025.12378.

Unveiling etiology and mortality risks in community-acquired pneumonia: A machine learning approach

Affiliations

Unveiling etiology and mortality risks in community-acquired pneumonia: A machine learning approach

Alaa Ali et al. Biomol Biomed. .

Abstract

Community-acquired pneumonia (CAP) is associated with high mortality, and accurate diagnosis and risk prediction are essential for improving patient outcomes. Traditional diagnostic methods have limitations, prompting the use of machine learning (ML) to enhance diagnostic precision and treatment strategies. This study aims to develop ML models to predict CAP etiology and mortality using clinical data to enable early intervention. A retrospective cohort study was conducted on 251 adult CAP patients admitted to two Jordanian hospitals between March 2021 and February 2024. Various clinical data were analyzed using ML techniques, including linear regression, random forest, SHapley Additive exPlanations (SHAP), lasso regression, mutual information analysis, logistic regression, and correlation analysis. Key predictors of CAP survival included zinc, vitamin C, enoxaparin, and insulin bolus. Mutual information analysis identified neutrophils, alanine transaminase, mean corpuscular volume, hemoglobin, and platelets as significant mortality predictors, while lasso regression highlighted meropenem, arterial blood gases, PCO₂, and platelet count. Logistic regression confirmed intensive care unit (ICU) stay, pH, pulmonary severity index, white blood cell (WBC) count, and bicarbonate levels as crucial variables. Interestingly, lymphocyte count emerged as the strongest predictor of bacterial CAP, conflicting with established knowledge that associates neutrophils with bacterial infections. However, findings related to HCO₃, blood urea nitrogen, and WBC levels were consistent with clinical expectations. SHAP analysis highlighted basophils and fever as key predictors. Further investigation is needed to resolve conflicting findings and optimize predictive models. ML offers promising applications for CAP prognosis but requires refinement to address discrepancies and improve reliability in clinical decision-making.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest: Authors declare no conflicts of interest.

Figures

Figure 1.
Figure 1.
Overview of the six-step machine learning workflow for predicting outcome probabilities.
Figure 2.
Figure 2.
Heatmap of strongly correlated features. Pearson correlation coefficients: A heatmap of strongly correlated features. Warm colors represent higher correlations, while cool colors indicate negative correlations. Strong positive correlations are observed between inflammatory markers (e.g., C-reactive protein [CRP], ferritin) and between white blood cell (WBC) count and neutrophil count (r > 0.7). This pattern suggests potential collinearity among markers of systemic inflammation, which may affect model stability and motivate variable selection or regularization strategies.
Figure 3.
Figure 3.
Heatmap of top correlated features with mortality outcomes. Pearson correlation coefficients: Heatmap of features and target outcome. Warm colors represent higher correlations, while cool colors indicate negative correlations. Top positive correlates: age (r ≈ 0.45), neutrophils (r ≈ 0.41), CRP (r ≈ 0.39), ferritin (r ≈ 0.36); top negative correlates: lymphocytes (r ≈ −0.42), oxygen saturation (r ≈ −0.38), hemoglobin (r ≈ −0.33). A focused panel also highlights high correlations for vitamin C, zinc, enoxaparin (CLEXAN), and insulin bolus. Abbreviation: LOS: Length of stay.
Figure 4.
Figure 4.
Mutual information of features related to mortality outcomes. Mutual information scores quantify each feature's dependency on mortality, producing a ranked list of informative predictors. Top contributors include creatinine, WBC (including eosinophils), and neutrophil count; the number of previous hospitalizations also ranks highly. RDW and ALT are additional significant features. Overall, inflammatory markers—eosinophils, neutrophils, and basophils—show high informativeness, underscoring their value for model prioritization and clinical decision-making. Abbreviations: WBC: White blood cell count; RDW: Red blood celldistribution width; ALT: Alanine aminotransferase.
Figure 5.
Figure 5.
Top feature coefficients from Lasso regression. Coefficients indicate each variable's direction and magnitude of association with mortality. “Culture” shows the largest positive coefficient (β ≈ 0.15) but is not clinically meaningful (test-performed indicator). Meropenem has a strong positive coefficient; ABG variables (pH, Base Excess, PCO2) and platelet count also contribute positively, underscoring their relevance for risk prediction. Abbreviation: ICU: Intensive care unit.
Figure 6.
Figure 6.
Feature importance from logistic regression. Coefficients indicate direction and strength. pH is the most impactful predictor with a strong negative coefficient (≈ −1.0). ICU_LOS and LOS show substantial positive effects (≈ 0.6 and ≈ 0.4), indicating longer stays are linked to higher mortality. Additional important contributors include PCO2, albumin, WBC, bicarbonate (HCO3), and ABG measures. Abbreviations: BUN: Blood urea nitrogen; WBC: White blood cell count; PSI: Pneumonia severity index.
Figure 7.
Figure 7.
SHAP summary plot: Analyzing the influence of medications and laboratory findings on model predictions. Features are ranked by mean absolute SHAP values. Antibiotic use shows the strongest negative impact (SHAP < −0.6), with basophils and initial fever also lowering predicted risk. Enoxaparin and ciprofloxacin/piperacillin–tazobactam susceptibility align with lower risk, whereas meropenem/imipenem/amikacin susceptibility and tocilizumab, prednisolone, and anticoagulant use have positive impacts, likely reflecting greater disease severity. Abbreviations: HDL: High-density lipoprotein; SHAP: Shapley additive explanations.
Figure 8.
Figure 8.
Heatmap depicting correlations between clinical variables and bacterial infections. Pearson correlation coefficients: Heatmap of features and target outcome. Warm colors represent higher correlations, while cool colors indicate negative correlations. The target was encoded as 1 for bacterial infection and 2 for no infection; thus higher coefficients indicate a greater likelihood of bacterial infection. This visualization supports ML-based etiology prediction in CAP. Abbreviations: SOB: Shortness of breath; CAP: Community-acquired pneumonia; ML: Machine learning.
Figure 9.
Figure 9.
Feature importance in logistic regression models. The horizontal bar plot illustrates the coefficient of each feature, reflecting its actual contribution to the model’s predictions. A positive coefficient indicates a positive association with the target variable, while a negative coefficient signifies a negative association. Abbreviation: PCR: Polymerase chain reaction.
Figure 10.
Figure 10.
SHAP summary plot: Analyzing feature impact on model output. This figure presents the SHAP values that demonstrate the influence of various features on the model’s predictions. Positive SHAP values indicate a beneficial contribution to the outcome, whereas negative values reflect a detrimental impact. The color gradient, ranging from blue (indicating low feature values) to red (indicating high feature values), underscores the relationship between feature magnitude and model output. Notably influential features include Antibiotics, Basophils, and initial findings of fever, each exhibiting distinct effects based on their respective values. Abbreviations: SHAP: Shapley additive explanations; PCR: Polymerase chain reaction.
Figure 11.
Figure 11.
Correlation of clinical and treatment features with bacterial infection as a primary outcome. Antibiotics show the strongest positive correlation, with enoxaparin and anticoagulant use also positively associated; initial fever and basophils exhibit moderate positive correlations. In contrast, mAb tocilizumab and amikacin susceptibility show weak negative correlations, while other susceptibilities (e.g., cefepime, ciprofloxacin) are minimal or near zero. Abbreviation: HDL: High-density lipoprotein.

References

    1. Metlay JP, Waterer GW, Long AC, Anzueto A, Brozek J, Crothers K, et al. Diagnosis and treatment of adults with community-acquired pneumonia. An official clinical practice guideline of the American thoracic society and infectious diseases society of America. Am J Respir Crit Care Med. 2019;200(7):e45–67. https://doi.org/10.12746/swrccc.v8i33.625. - PMC - PubMed
    1. Musher DM, Thorner AR. Community-acquired pneumonia. N Engl J Med. 2014;371(17):1619–28. https://doi.org/10.1056/NEJMra1312885. - PubMed
    1. Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380(14):1347–58. https://doi.org/10.1056/NEJMra1814259. - PubMed
    1. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1(6):e271–97. https://doi.org/10.1016/S2589-7500(19)30123-2. - PubMed
    1. Kermany DS, Goldbaum M, Cai W, Valentim CC, Liang H, Baxter SL, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122–31.e9. https://doi.org/10.1016/j.cell.2018.02.010. - PubMed