. 2024 Aug 13:3:e54371.

doi: 10.2196/54371.

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study

Aidin Spina¹, Saman Andalib¹, Daniel Flores^#¹, Rishi Vermani^#¹, Faris F Halaseh¹, Ariana M Nelson^{1

2}

Affiliations

¹ School of Medicine, University of California, Irvine, Irvine, CA, United States.
² Department of Anesthesiology and Perioperative Care, University of California, Irvine, Irvine, CA, United States.

^# Contributed equally.

PMID: 39137416
PMCID: PMC11350306
DOI: 10.2196/54371

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study

Aidin Spina et al. JMIR AI. 2024.

. 2024 Aug 13:3:e54371.

doi: 10.2196/54371.

Authors

Aidin Spina¹, Saman Andalib¹, Daniel Flores^#¹, Rishi Vermani^#¹, Faris F Halaseh¹, Ariana M Nelson^{1

2}

Affiliations

¹ School of Medicine, University of California, Irvine, Irvine, CA, United States.
² Department of Anesthesiology and Perioperative Care, University of California, Irvine, Irvine, CA, United States.

^# Contributed equally.

PMID: 39137416
PMCID: PMC11350306
DOI: 10.2196/54371

Abstract

Background: Although uncertainties exist regarding implementation, artificial intelligence-driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy.

Objective: The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy.

Methods: Input templates related to 2 prevalent chronic diseases-type II diabetes and hypertension-were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL).

Results: Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor's, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor's degree. Conversely, GPT-4's FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels.

Conclusions: GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor's degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy.

Keywords: AI; GLM; GLMs; LHL; NLP; artificial intelligence; comprehension; education; generative; generative language model; health information; health literacy; knowledge translation; language model; language models; low health literacy; medical information; medical text; medical texts; natural language processing; readability; reading level; reading levels; understandability; understandable.

©Aidin Spina, Saman Andalib, Daniel Flores, Rishi Vermani, Faris F Halaseh, Ariana M Nelson. Originally published in JMIR AI (https://ai.jmir.org), 13.08.2024.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
(A) GPT-4 diabetes FKRE, compared with single-factor ANOVA and Tukey post hoc test. (B) GPT-4 diabetes FKGL, compared with single-factor ANOVA and Tukey post hoc Test. (C) GPT-3.5 diabetes FKRE, compared with single-factor ANOVA and Tukey post hoc test. (D) GPT-3.5 diabetes FKGL, compared with single-factor ANOVA and Tukey post hoc test. FKRE: Flesch reading ease score; FKGL: Flesch-Kincaid grade level.

**Figure 2**
(A) GPT-4 HTN FKRE, compared with single-factor ANOVA and Tukey post hoc test. (B) GPT-4 HTN FKGL, compared with single-factor ANOVA and Tukey post hoc test. (C) GPT-3.5 HTN FKRE, compared with single-factor ANOVA and Tukey post hoc test. (D) GPT-3.5 HTN FKGL, compared with single-factor ANOVA and Tukey post hoc test. FKRE: Flesch reading ease score; FKGL: Flesch-Kincaid grade level; HTN: hypertension.

**Figure 3**
(A) GPT-4 aggregated FKRE, compared with single-factor ANOVA and Tukey post hoc test. (B) GPT-4 aggregated FKGL, compared with single-factor ANOVA and Tukey post hoc test. (C) GPT-3.5 aggregated FKRE, compared with single-factor ANOVA and Tukey post hoc test. (D) GPT-3.5 aggregated FKGL, compared with single-factor ANOVA and Tukey post hoc test. FKRE: Flesch reading ease score; FKGL: Flesch-Kincaid grade level.

**Figure 4**
(A) Comparison of FKRE between GPT-4 and GPT-3.5 for diabetes outputs at the 6th-grade level, analyzed with an unpaired 2-tailed t test. (B) Comparison of FKGL between GPT-4 and GPT-3.5 for diabetes outputs at the 6th-grade level, analyzed with an unpaired 2-tailed t test. (C) Comparison of FKRE between GPT-4 and GPT-3.5 for diabetes outputs at the 8th-grade level, analyzed with an unpaired 2-tailed t test. (D) Comparison of FKGL between GPT-4 and GPT-3.5 for diabetes outputs at the 8th-grade level, analyzed with an unpaired 2-tailed t test. (E) Comparison of FKRE between GPT-4 and GPT-3.5 for diabetes outputs at the high school level, analyzed with an unpaired 2-tailed t test. (F) Comparison of FKGL between GPT-4 and GPT-3.5 for diabetes outputs at the high school level, analyzed with an unpaired 2-tailed t test. (G) Comparison of FKRE between GPT-4 and GPT-3.5 for diabetes outputs at the bachelor’s level, analyzed with an unpaired 2-tailed t test. (H) Comparison of FKGL between GPT-4 and GPT-3.5 for diabetes outputs at the bachelor’s level, analyzed with an unpaired 2-tailed t test. FKRE: Flesch reading ease score; FKGL: Flesch-Kincaid grade level.

**Figure 5**
(A) Comparison of FKRE between GPT-4 and GPT-3.5 for HTN outputs at the 6th-grade level, analyzed with an unpaired 2-tailed t test. (B) Comparison of FKGL between GPT-4 and GPT-3.5 for HTN outputs at the 6th-grade level, analyzed with an unpaired 2-tailed t test. (C) Comparison of FKRE between GPT-4 and GPT-3.5 for HTN outputs at the 8th-grade level, analyzed with an unpaired 2-tailed t test. (D) Comparison of FKGL between GPT-4 and GPT-3.5 for HTN outputs at the 8th-grade level, analyzed with an unpaired 2-tailed t test. (E) Comparison of FKRE between GPT-4 and GPT-3.5 for HTN outputs at the high school level, analyzed with an unpaired 2-tailed t test. (F) Comparison of FKGL between GPT-4 and GPT-3.5 for HTN outputs at the high school level, analyzed with an unpaired 2-tailed t test. (G) Comparison of FKRE between GPT-4 and GPT-3.5 for HTN outputs at the bachelor’s level, analyzed with an unpaired 2-tailed t test. (H) Comparison of FKGL between GPT-4 and GPT-3.5 for HTN outputs at the bachelor’s level, analyzed with an unpaired 2-tailed t test. FKRE: Flesch reading ease score; FKGL: Flesch-Kincaid grade level; HTN: hypertension.

**Figure 6**
(A) Comparison of FKRE between GPT-4 and GPT-3.5 for aggregated outputs at the 6th-grade level, analyzed with an unpaired 2-tailed t test. (B) Comparison of FKGL between GPT-4 and GPT-3.5 for aggregated outputs at the 6th-grade level, analyzed with an unpaired 2-tailed t test. (C) Comparison of FKRE between GPT-4 and GPT-3.5 for aggregated outputs at the 8th-grade level, analyzed with an unpaired 2-tailed t test. (D) Comparison of FKGL between GPT-4 and GPT-3.5 for aggregated outputs at the 8th-grade level, analyzed with an unpaired 2-tailed t test. (E) Comparison of FKRE between GPT-4 and GPT-3.5 for aggregated outputs at the high school level, analyzed with an unpaired 2-tailed t test. (F) Comparison of FKGL between GPT-4 and GPT-3.5 for aggregated outputs at the high school level, analyzed with an unpaired 2-tailed t test. (G) Comparison of FKRE between GPT-4 and GPT-3.5 for aggregated outputs at the bachelor’s level, analyzed with an unpaired 2-tailed t test. (H) Comparison of FKGL between GPT-4 and GPT-3.5 for aggregated outputs at the bachelor’s level, analyzed with an unpaired 2-tailed t test. FKRE: Flesch reading ease score; FKGL: Flesch-Kincaid grade level.

See this image and copyright information in PMC

References

1. Kutner MGE, Jin Y, Paulsen C. The health literacy of America's adults: results from the 2003 National Assessment of Adult Literacy. NCES 2006-483. Washington, DC: National Center for Education Statistics; 2006. [2024-07-19]. https://nces.ed.gov/pubs2006/2006483.pdf .
1. Institute of Medicine. Board on Neuroscience and Behavioral Health . In: Health Literacy: A Prescription to End Confusion. Nielsen-Bohlman L, Panzer AM, Kindig DA, editors. Washington DC: National Academies Press; 2004. - PubMed
1. Sudore RL, Schillinger D. Interventions to improve care for patients with limited health literacy. J Clin Outcomes Manag. 2009;16(1):20–29. https://europepmc.org/abstract/MED/20046798 - PMC - PubMed
1. Murray K, Liang A, Barnack-Tavlaris J, Navarro AM. The reach and rationale for community health fairs. J Cancer Educ. 2014 Mar;29(1):19–24. doi: 10.1007/s13187-013-0528-3. https://europepmc.org/abstract/MED/23907787 - DOI - PMC - PubMed
1. Internet, broadband fact sheet. Washington, DC: Pew Research Center; 2024. [2024-07-19]. https://www.pewresearch.org/internet/fact-sheet/internet-broadband/

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study

Affiliations

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous