Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer
- PMID: 37724963
- DOI: 10.1148/radiol.231362
Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer
Abstract
Background The latest large language models (LLMs) solve unseen problems via user-defined text prompts without the need for retraining, offering potentially more efficient information extraction from free-text medical records than manual annotation. Purpose To compare the performance of the LLMs ChatGPT and GPT-4 in data mining and labeling oncologic phenotypes from free-text CT reports on lung cancer by using user-defined prompts. Materials and Methods This retrospective study included patients who underwent lung cancer follow-up CT between September 2021 and March 2023. A subset of 25 reports was reserved for prompt engineering to instruct the LLMs in extracting lesion diameters, labeling metastatic disease, and assessing oncologic progression. This output was fed into a rule-based natural language processing pipeline to match ground truth annotations from four radiologists and derive performance metrics. The oncologic reasoning of LLMs was rated on a five-point Likert scale for factual correctness and accuracy. The occurrence of confabulations was recorded. Statistical analyses included Wilcoxon signed rank and McNemar tests. Results On 424 CT reports from 424 patients (mean age, 65 years ± 11 [SD]; 265 male), GPT-4 outperformed ChatGPT in extracting lesion parameters (98.6% vs 84.0%, P < .001), resulting in 96% correctly mined reports (vs 67% for ChatGPT, P < .001). GPT-4 achieved higher accuracy in identification of metastatic disease (98.1% [95% CI: 97.7, 98.5] vs 90.3% [95% CI: 89.4, 91.0]) and higher performance in generating correct labels for oncologic progression (F1 score, 0.96 [95% CI: 0.94, 0.98] vs 0.91 [95% CI: 0.89, 0.94]) (both P < .001). In oncologic reasoning, GPT-4 had higher Likert scale scores for factual correctness (4.3 vs 3.9) and accuracy (4.4 vs 3.3), with a lower rate of confabulation (1.7% vs 13.7%) than ChatGPT (all P < .001). Conclusion When using user-defined prompts, GPT-4 outperformed ChatGPT in extracting oncologic phenotypes from free-text CT reports on lung cancer and demonstrated better oncologic reasoning with fewer confabulations. © RSNA, 2023 Supplemental material is available for this article. See also the editorial by Hafezi-Nejad and Trivedi in this issue.
Comment in
-
Foundation AI Models and Data Extraction from Unlabeled Radiology Reports: Navigating Uncharted Territory.Radiology. 2023 Sep;308(3):e232308. doi: 10.1148/radiol.232308. Radiology. 2023. PMID: 37724971 Free PMC article. No abstract available.
Similar articles
-
Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large Language Models and Six Human Readers of Varying Experience.AJR Am J Roentgenol. 2024 Dec;223(6):e2431696. doi: 10.2214/AJR.24.31696. Epub 2024 Sep 4. AJR Am J Roentgenol. 2024. PMID: 39230409
-
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895. Radiology. 2025. PMID: 39807977
-
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117. Radiology. 2024. PMID: 38888478
-
The impact and opportunities of large language models like ChatGPT in oral and maxillofacial surgery: a narrative review.Int J Oral Maxillofac Surg. 2024 Jan;53(1):78-88. doi: 10.1016/j.ijom.2023.09.005. Epub 2023 Oct 3. Int J Oral Maxillofac Surg. 2024. PMID: 37798200 Review.
-
ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine.J Pediatr Urol. 2023 Oct;19(5):598-604. doi: 10.1016/j.jpurol.2023.05.018. Epub 2023 Jun 2. J Pediatr Urol. 2023. PMID: 37328321 Review.
Cited by
-
SEETrials: Leveraging large language models for safety and efficacy extraction in oncology clinical trials.Inform Med Unlocked. 2024;50:101589. doi: 10.1016/j.imu.2024.101589. Epub 2024 Oct 11. Inform Med Unlocked. 2024. PMID: 39493413 Free PMC article.
-
Radiology and multi-scale data integration for precision oncology.NPJ Precis Oncol. 2024 Jul 26;8(1):158. doi: 10.1038/s41698-024-00656-0. NPJ Precis Oncol. 2024. PMID: 39060351 Free PMC article. Review.
-
Breaking Boundaries in Spinal Surgery: GPT-4's Quest to Revolutionize Surgical Site Infection Management.J Infect Dis. 2025 Feb 20;231(2):e345-e354. doi: 10.1093/infdis/jiae403. J Infect Dis. 2025. PMID: 39136574 Free PMC article.
-
Amplifying Chinese physicians' emphasis on patients' psychological states beyond urologic diagnoses with ChatGPT - a multicenter cross-sectional study.Int J Surg. 2024 Oct 1;110(10):6501-6508. doi: 10.1097/JS9.0000000000001775. Int J Surg. 2024. PMID: 38954666 Free PMC article.
-
Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department.JAMA Netw Open. 2024 May 1;7(5):e248895. doi: 10.1001/jamanetworkopen.2024.8895. JAMA Netw Open. 2024. PMID: 38713466 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources
Medical