Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report

doi:10.2196/72638

Multicenter Study

. 2025 Jun 11:27:e72638.

doi: 10.2196/72638.

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report

Ronghao Li^#¹, Shuai Mao^#², Congmin Zhu¹, Yingliang Yang¹, Chunting Tan³, Li Li⁴, Xiangdong Mu⁴, Honglei Liu^#¹, Yuqing Yang^#²

Affiliations

¹ School of Biomedical Engineering, Capital Medical University, No. 10, Xitoutiao, You An Men, Fengtai District, Beijing, 100069, China, 86 010-83911542.
² State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China.
³ Department of Respiratory Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, China.
⁴ Beijing Respiratory and Critical Care Medicine Department, Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, Beijing, China.

^# Contributed equally.

PMID: 40499132
PMCID: PMC12176309
DOI: 10.2196/72638

Multicenter Study

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report

Ronghao Li et al. J Med Internet Res. 2025.

. 2025 Jun 11:27:e72638.

doi: 10.2196/72638.

Authors

Ronghao Li^#¹, Shuai Mao^#², Congmin Zhu¹, Yingliang Yang¹, Chunting Tan³, Li Li⁴, Xiangdong Mu⁴, Honglei Liu^#¹, Yuqing Yang^#²

Affiliations

¹ School of Biomedical Engineering, Capital Medical University, No. 10, Xitoutiao, You An Men, Fengtai District, Beijing, 100069, China, 86 010-83911542.
² State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China.
³ Department of Respiratory Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, China.
⁴ Beijing Respiratory and Critical Care Medicine Department, Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, Beijing, China.

^# Contributed equally.

PMID: 40499132
PMCID: PMC12176309
DOI: 10.2196/72638

Abstract

Background: The rapid advancements in natural language processing, particularly the development of large language models (LLMs), have opened new avenues for managing complex clinical text data. However, the inherent complexity and specificity of medical texts present significant challenges for the practical application of prompt engineering in diagnostic tasks.

Objective: This paper explores LLMs with new prompt engineering technology to enhance model interpretability and improve the prediction performance of pulmonary disease based on a traditional deep learning model.

Methods: A retrospective dataset including 2965 chest CT radiology reports was constructed. The reports were from 4 cohorts, namely, healthy individuals and patients with pulmonary tuberculosis, lung cancer, and pneumonia. Then, a novel prompt engineering strategy that integrates feature summarization (F-Sum), chain of thought (CoT) reasoning, and a hybrid retrieval-augmented generation (RAG) framework was proposed. A feature summarization approach, leveraging term frequency-inverse document frequency (TF-IDF) and K-means clustering, was used to extract and distill key radiological findings related to 3 diseases. Simultaneously, the hybrid RAG framework combined dense and sparse vector representations to enhance LLMs' comprehension of disease-related text. In total, 3 state-of-the-art LLMs, GLM-4-Plus, GLM-4-air (Zhipu AI), and GPT-4o (OpenAI), were integrated with the prompt strategy to evaluate the efficiency in recognizing pneumonia, tuberculosis, and lung cancer. The traditional deep learning model, BERT (Bidirectional Encoder Representations from Transformers), was also compared to assess the superiority of LLMs. Finally, the proposed method was tested on an external validation dataset consisted of 343 chest computed tomography (CT) report from another hospital.

Results: Compared with BERT-based prediction model and various other prompt engineering techniques, our method with GLM-4-Plus achieved the best performance on test dataset, attaining an F1-score of 0.89 and accuracy of 0.89. On the external validation dataset, F1-score (0.86) and accuracy (0.92) of the proposed method with GPT-4o were the highest. Compared to the popular strategy with manually selected typical samples (few-shot) and CoT designed by doctors (F1-score=0.83 and accuracy=0.83), the proposed method that summarized disease characteristics (F-Sum) based on LLM and automatically generated CoT performed better (F1-score=0.89 and accuracy=0.90). Although the BERT-based model got similar results on the test dataset (F1-score=0.85 and accuracy=0.88), its predictive performance significantly decreased on the external validation set (F1-score=0.48 and accuracy=0.78).

Conclusions: These findings highlight the potential of LLMs to revolutionize pulmonary disease prediction, particularly in resource-constrained settings, by surpassing traditional models in both accuracy and flexibility. The proposed prompt engineering strategy not only improves predictive performance but also enhances the adaptability of LLMs in complex medical contexts, offering a promising tool for advancing disease diagnosis and clinical decision-making.

Keywords: LLM; RAG; large language models; prompt engineering; pulmonary disease prediction; retrieval-augmented generation.

© Ronghao Li, Shuai Mao, Congmin Zhu, Yingliang Yang, Chunting Tan, Li Li, Xiangdong Mu, Honglei Liu, Yuqing Yang. Originally published in the Journal of Medical Internet Research (https://www.jmir.org).

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1.. Workflow of the large language model (LLM)–based classification model. COT: chain of thought; CT: computed tomography; F-Sum: feature summarization; RAG: retrieval-augmented generation.**

Figure 2.. Detailed strategy diagrams for each section. A. The prompt for feature summarization using the large language model (LLM). B. The prompt for generating chains of thought (CoT) questions with LLM. C. The hybrid retrieval-augmented generation (RAG) process for retrieving similar reports based on both dense and sparse vector representations. D. The format of similar reports retrieved by RAG in the final prompt for LLMs. ANN: approximate nearest neighbor search; BGE: BAAI General Embedding; RRF: relevance-weighted rank fusion.

Figure 3.. Confusion matrixes of large language model (LLM)–based and Bidirectional Encoder Representations from Transformers (BERT)–based models. CoT: chain of thought; F-Sum: feature summarization; LC: lung cancer; ND: no disease; PN: pneumonia; RAG: retrieval-augmented generation; TB: tuberculosis.

See this image and copyright information in PMC

References

1. Huang S, Yang J, Shen N, Xu Q, Zhao Q. Artificial intelligence in lung cancer diagnosis and prognosis: current application and future perspective. Semin Cancer Biol. 2023 Feb;89:30–37. doi: 10.1016/j.semcancer.2023.01.006. doi. Medline. - DOI - PubMed
1. Feng X, Goodley P, Alcala K, et al. Evaluation of risk prediction models to select lung cancer screening participants in Europe: a prospective cohort consortium analysis. Lancet Digit Health. 2024 Sep;6(9):e614–e624. doi: 10.1016/S2589-7500(24)00123-7. doi. Medline. - DOI - PMC - PubMed
1. Uwimana A, Gnecco G, Riccaboni M. Artificial intelligence for breast cancer detection and its health technology assessment: a scoping review. Comput Biol Med. 2025 Jan;184:109391. doi: 10.1016/j.compbiomed.2024.109391. doi. Medline. - DOI - PubMed
1. Daniel R, Jones H, Gregory JW, et al. Predicting type 1 diabetes in children using electronic health records in primary care in the UK: development and validation of a machine-learning algorithm. Lancet Digit Health. 2024 Jun;6(6):e386–e395. doi: 10.1016/S2589-7500(24)00050-5. doi. Medline. - DOI - PubMed
1. Vaswani A, Shazeer N, Parmar N. Attention is all you need. arXiv. 2017 Jun 12; doi: 10.48550/arXiv.1706.03762. Preprint posted online on. doi. - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- JMIR Publications
- PubMed Central
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

[1] Huang S, Yang J, Shen N, Xu Q, Zhao Q. Artificial intelligence in lung cancer diagnosis and prognosis: current application and future perspective. Semin Cancer Biol. 2023 Feb;89:30–37. doi: 10.1016/j.semcancer.2023.01.006. doi. Medline. - DOI - PubMed

[2] Huang S, Yang J, Shen N, Xu Q, Zhao Q. Artificial intelligence in lung cancer diagnosis and prognosis: current application and future perspective. Semin Cancer Biol. 2023 Feb;89:30–37. doi: 10.1016/j.semcancer.2023.01.006. doi. Medline. - DOI - PubMed

[3] Feng X, Goodley P, Alcala K, et al. Evaluation of risk prediction models to select lung cancer screening participants in Europe: a prospective cohort consortium analysis. Lancet Digit Health. 2024 Sep;6(9):e614–e624. doi: 10.1016/S2589-7500(24)00123-7. doi. Medline. - DOI - PMC - PubMed

[4] Feng X, Goodley P, Alcala K, et al. Evaluation of risk prediction models to select lung cancer screening participants in Europe: a prospective cohort consortium analysis. Lancet Digit Health. 2024 Sep;6(9):e614–e624. doi: 10.1016/S2589-7500(24)00123-7. doi. Medline. - DOI - PMC - PubMed

[5] Uwimana A, Gnecco G, Riccaboni M. Artificial intelligence for breast cancer detection and its health technology assessment: a scoping review. Comput Biol Med. 2025 Jan;184:109391. doi: 10.1016/j.compbiomed.2024.109391. doi. Medline. - DOI - PubMed

[6] Uwimana A, Gnecco G, Riccaboni M. Artificial intelligence for breast cancer detection and its health technology assessment: a scoping review. Comput Biol Med. 2025 Jan;184:109391. doi: 10.1016/j.compbiomed.2024.109391. doi. Medline. - DOI - PubMed

[7] Daniel R, Jones H, Gregory JW, et al. Predicting type 1 diabetes in children using electronic health records in primary care in the UK: development and validation of a machine-learning algorithm. Lancet Digit Health. 2024 Jun;6(6):e386–e395. doi: 10.1016/S2589-7500(24)00050-5. doi. Medline. - DOI - PubMed

[8] Daniel R, Jones H, Gregory JW, et al. Predicting type 1 diabetes in children using electronic health records in primary care in the UK: development and validation of a machine-learning algorithm. Lancet Digit Health. 2024 Jun;6(6):e386–e395. doi: 10.1016/S2589-7500(24)00050-5. doi. Medline. - DOI - PubMed

[9] Vaswani A, Shazeer N, Parmar N. Attention is all you need. arXiv. 2017 Jun 12; doi: 10.48550/arXiv.1706.03762. Preprint posted online on. doi. - DOI

[10] Vaswani A, Shazeer N, Parmar N. Attention is all you need. arXiv. 2017 Jun 12; doi: 10.48550/arXiv.1706.03762. Preprint posted online on. doi. - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report

Affiliations

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous