Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study
- PMID: 40768757
- PMCID: PMC12327907
- DOI: 10.2196/74231
Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study
Abstract
Background: Patient complaints provide valuable insights into the performance of health care systems, highlighting potential risks not apparent to staff. Patient complaints can drive systemic changes that enhance patient safety. However, manual categorization and analysis pose a huge logistical challenge, hindering the ability to harness the potential of these data.
Objective: This study aims to evaluate the accuracy of artificial intelligence (AI)-powered categorization of patient complaints in primary care based on the Healthcare Complaint Analysis Tool (HCAT) General Practice (GP) taxonomy and assess the importance of advanced large language models (LLMs) in complaint categorization.
Methods: This cross-sectional study analyzed 1816 anonymous patient complaints from 7 public primary care clinics in Singapore. Complaints were first coded by trained human coders using the HCAT (GP) taxonomy through a rigorous process involving independent assessment and consensus discussions. LLMs (GPT-3.5 turbo, GPT-4o mini, and Claude 3.5 Sonnet) were used to validate manual classification. Claude 3.5 Sonnet was further used to identify complaint themes. LLM classifications were assessed for accuracy and consistency with human coding using accuracy and F1-score. Cohen κ and McNemar test evaluated AI-human agreement and compared AI models' concordance, respectively.
Results: The majority of complaints fell under the HCAT (GP) domain of management (1079/1816, 59.4%), specifically relating to institutional processes (830/1816, 45.7%). Most complaints were of medium severity (994/1816, 54.7%), occurred within the practice (627/1816, 34.5%), and resulted in minimal harm (75.4%). LLMs achieved moderate to good accuracy (58.4%-95.5%) in HCAT (GP) field classifications, with GPT-4o mini generally outperforming GPT-3.5 turbo, except in severity classification. All 3 LLMs demonstrated moderate concordance rates (average 61.9%-68.8%) in complaints classification with varying levels of agreement (κ=0.114-0.623). GPT-4o mini and Claude 3.5 significantly outperformed GPT-3.5 turbo in several fields (P<.05), such as domain and stage of care classification. Thematic analysis using Claude 3.5 identified long wait times (393/1816, 21.6%), staff attitudes (287/1816, 15.8%), and appointment booking issues (191/1816, 10.5%) as the top concerns, which accounted for nearly half of all complaints.
Conclusions: Our study highlighted the potential of LLMs in classifying patient complaints in primary care using HCAT (GP) taxonomy. While GPT-4o and Claude 3.5 demonstrated promising results, further fine-tuning and model training are required to improve accuracy. Integrating AI into complaint analysis can facilitate proactive identification of systemic issues, ultimately enhancing quality improvement and patient safety. By leveraging LLMs, health care organizations can prioritize complaints and escalate high-risk issues more effectively. Theoretically, this could lead to improved patient care and experience; further research is needed to confirm this potential benefit.
Keywords: artificial intelligence; family medicine; health services; large language models; patient complaints; primary care.
© Sky Wei Chee Koh, Eunice Rui Ning Wong, John Chong Min Tan, Stephanie C C van der Lubbe, Jun Cong Goh, Ethan Sheng Yong Ching, Ian Wen Yih Chia, Si Hui Low, Ping Young Ang, Queenie Quek, Mehul Motani, Jose M Valderas. Originally published in the Journal of Medical Internet Research (https://www.jmir.org).
Conflict of interest statement
Similar articles
-
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910. J Med Internet Res. 2025. PMID: 40392576 Free PMC article.
-
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489. J Med Internet Res. 2025. PMID: 40466102 Free PMC article.
-
Large Language Model Symptom Identification From Clinical Text: Multicenter Study.J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984. J Med Internet Res. 2025. PMID: 40743494 Free PMC article.
-
Examining the Role of Large Language Models in Orthopedics: Systematic Review.J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607. J Med Internet Res. 2024. PMID: 39546795 Free PMC article.
-
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3. Cochrane Database Syst Rev. 2022. PMID: 35593186 Free PMC article.
References
-
- Clwyd A, Hart T. UK National Health Service; 2013. [25-07-2025]. A review of the NHS hospitals complaints system putting patients back in the picture: final report.https://assets.publishing.service.gov.uk/media/5a7cb9eb40f0b65b3de0aca7/... URL. Accessed.
-
- Chan B, Cochrane D, Canadian Institute for Health Information, Canadian Patient Safety Institute . Canadian Institute for Health Information; 2016. [25-07-2025]. Measuring patient harm in Canadian hospitals.https://tinyurl.com/mryeb5wk URL. Accessed.
MeSH terms
LinkOut - more resources
Full Text Sources