Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Minjie Lin¹, Chuanjun Xu², Xiang Qu³, Biyun Xu⁴, Yanghong Wang⁵, Ruyi Zou⁶, Yingwei Zhang⁶

Affiliations

¹ Department of Pulmonary and Critical Care Medicine, The Second Hospital of Nanjing, Nanjing University of Chinese Medicine, Nanjing, China.
² Department of Radiology, The Second Hospital of Nanjing, Nanjing University of Chinese Medicine, Nanjing, China.
³ Department of Pediatrics, Liuzhou Worker's Hospital, Fourth Affiliated Hospital of Guangxi Medical University, Liuzhou, China.
⁴ Medical Statistics and Analysis Center, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China.
⁵ Department of Pulmonary and Critical Care Medicine, Huai'an Cancer Hospital, Huai'an, China.
⁶ Department of Pulmonary and Critical Care Medicine, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China.

PMID: 41036434
PMCID: PMC12480815
DOI: 10.1177/20552076251384233

Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Minjie Lin et al. Digit Health. 2025.

. 2025 Sep 29:11:20552076251384233.

doi: 10.1177/20552076251384233. eCollection 2025 Jan-Dec.

Authors

Minjie Lin¹, Chuanjun Xu², Xiang Qu³, Biyun Xu⁴, Yanghong Wang⁵, Ruyi Zou⁶, Yingwei Zhang⁶

Affiliations

¹ Department of Pulmonary and Critical Care Medicine, The Second Hospital of Nanjing, Nanjing University of Chinese Medicine, Nanjing, China.
² Department of Radiology, The Second Hospital of Nanjing, Nanjing University of Chinese Medicine, Nanjing, China.
³ Department of Pediatrics, Liuzhou Worker's Hospital, Fourth Affiliated Hospital of Guangxi Medical University, Liuzhou, China.
⁴ Medical Statistics and Analysis Center, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China.
⁵ Department of Pulmonary and Critical Care Medicine, Huai'an Cancer Hospital, Huai'an, China.
⁶ Department of Pulmonary and Critical Care Medicine, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China.

PMID: 41036434
PMCID: PMC12480815
DOI: 10.1177/20552076251384233

Abstract

Background: In recent years, potential applications of ChatGPT in medication-related practices have drawn great attention for its intuitive user interfaces, chatbot, and powerful analytical capabilities. However, whether ChatGPT can be broadly applied in clinical practice remains controversial. Early screening, monitoring, and timely treatment are crucial for improving outcomes of interstitial lung disease (ILD) in systemic autoimmune rheumatic diseases (SARDs) due to its high morbidity and mortality rate. This study aimed to evaluate the reliability, repeatability, and confidence of ChatGPT models (GPT-4, GPT-4o mini, and GPT-4o) in delivering guideline-based recommendations for the screening, monitoring, and treatment of ILD in SARD patients.

Methods: Questions derived from the ACR/CHEST guideline for ILD patients with SARDs were used to benchmark three versions of ChatGPT (GPT-4, GPT-4o mini, and GPT-4o) across three separate attempts. The responses were recorded, and the reliability, repeatability, and confidence were analyzed with the recommendations from the guideline.

Results: GPT-4 demonstrated significant variability in reliability across the three attempts (P = .007). In contrast, the other versions showed no significant differences. GPT-4 and GPT-4o mini exhibited substantial interrater agreement (Kendall's W = 0.747 and 0.765, respectively), whereas GPT-4o demonstrated almost perfect interrater agreement (Kendall's W = 0.816). All three versions showed statistically significant differences in high confidence ratings (confidence score of ≥ 8 on the 1-10 scale) across the three attempts (P < .01). Given the higher consistency of GPT-4o and GPT-4o mini, a further comparison was conducted between them on the third attempt. No significant difference was observed in accuracy percentages across the third attempt between GPT-4o and GPT-4o mini (P = .597). Similarly, interrater agreement across the three attempts was not significantly different for both GPT-4o and GPT-4o mini (P = .152). Furthermore, the overconfidence percentage (confidence score of ≥8 assigned to incorrect answers) was 100% (22 of 22) for GPT-4o and 22.7% (10 of 44) for GPT-4o mini, respectively (P < .01).

Conclusions: GPT-4o mini and GPT-4o demonstrated stable reliability across all three attempts, whereas GPT-4 did not. The repeatability of GPT-4o tended to perform better than GPT-4o mini, although this difference was not statistically significant. Additionally, GPT-4o exhibited a higher tendency toward overconfidence compared to GPT-4o mini. Overall, the GPT-4o models performed most effectively in managing SARD-ILD but may exhibit overconfidence in certain scenarios.

Keywords: ChatGPT; Systemic autoimmune rheumatic diseases; interstitial lung disease.

PubMed Disclaimer

Conflict of interest statement

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

**Figure 1.**
The flowchart of this study consists of four sequential sections: “Question Selection and Classification,” “ChatGPT Prompting,” “Data Collection (ChatGPT Responses),” and “Statistical Analysis.”

**Figure 2.**
We selected a sample question and conducted three inquiries in three GPT models, comparing the results with the recommendations provided in the guideline.

**Figure 3.**
(a–c) Summary of results for GPT-4, GPT-4o mini, and GPT-4o (attempts 1–3 are shown from left to right). (d–f) For the subgroup of screening and monitoring (questions 1–21), a summary of results for GPT-4, GPT-4o mini, and GPT-4o (attempts 1–3 are shown from left to right). (h–i) For the subgroup of treatment (questions 22–103）, summary of results for GPT-4, GPT-4o mini, and GPT-4o (attempts 1–3 are shown from left to right).

**Figure 4.**
The accuracy, repeatability, and overconfidence percentages between GPT-4o and GPT-4o mini were compared on the third attempt.

See this image and copyright information in PMC

References

1. Mintz Y, Brodie R. Introduction to artificial intelligence in medicine. Minim Invasiv Ther 2019; 28: 73–81. - PubMed
1. Chrzan R, Wizner B, Sydor W, et al. Artificial intelligence guided HRCT assessment predicts the severity of COVID-19 pneumonia based on clinical parameters. Bmc Infect Dis 2023; 23: 314. - PMC - PubMed
1. Huang K, Wu X, Li Y, et al. Artificial intelligence-based psoriasis severity assessment: real-world study and application. J Med Internet Res 2023; 25: e44932. - PMC - PubMed
1. Parmar U, Surico PL, Singh RB, et al. Artificial intelligence (AI) for early diagnosis of retinal diseases. Medicina (Kaunas) 2024; 60: 527. - PMC - PubMed
1. Duwe G, Mercier D, Wiesmann C, et al. Challenges and perspectives in use of artificial intelligence to support treatment recommendations in clinical oncology. Cancer Med 2024; 13: e7398. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Atypon
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Affiliations

Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources