. 2025 Jan;7(1):e35-e43.

doi: 10.1016/S2589-7500(24)00246-2.

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study

Maria Clara Saad Menezes¹, Alexander F Hoffmann², Amelia L M Tan², Mariné Nalbandyan³, Gilbert S Omenn⁴, Diego R Mazzotti⁵, Alejandro Hernández-Arango⁶, Shyam Visweswaran⁷, Shruthi Venkatesh⁸, Kenneth D Mandl⁹, Florence T Bourgeois¹⁰, James W K Lee¹¹, Andrew Makmur¹², David A Hanauer¹³, Michael G Semanik³, Lauren T Kerivan¹⁴, Terra Hill¹⁴, Julian Forero⁶, Carlos Restrepo⁶, Matteo Vigna¹⁵, Piero Ceriana¹⁵, Noor Abu-El-Rub¹⁶, Paul Avillach², Riccardo Bellazzi¹⁷, Thomas Callaci³, Alba Gutiérrez-Sacristán², Alberto Malovini¹⁸, Jomol P Mathew³, Michele Morris⁷, Venkatesh L Murthy¹⁹, Tommaso M Buonocore¹⁷, Enea Parimbelli¹⁷, Lav P Patel¹⁶, Carlos Sáez²⁰, Malarkodi Jebathilagam Samayamuthu⁷, Jeffrey A Thompson²¹, Valentina Tibollo¹⁸, Zongqi Xia⁸, Isaac S Kohane²²; Consortium for Clinical Characterization of COVID-19 by Electronic Health Records

Affiliations

¹ Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA; Department of Internal Medicine, University of Texas at Southwestern, Dallas, TX, USA.
² Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA.
³ Office of Informatics and Information Technology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, USA.
⁴ Computational Medicine and Bioinformatics, Internal Medicine, Human Genetics, Environmental Health, University of Michigan, Ann Arbor, MI, USA.
⁵ Division of Medical Informatics and Division of Pulmonary Critical Care and Sleep Medicine, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, KS, USA.
⁶ Department of Internal Medicine, University of Antioquia, Hospital Alma Máter de Antioquia, Medellín, Colombia.
⁷ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁸ Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA.
⁹ Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.
¹⁰ Department of Pediatrics, Boston Children's Hospital, Boston, MA, USA.
¹¹ Department of Surgery, National University Health System, Singapore.
¹² Department of Diagnostic Imaging, National University Health System, Singapore.
¹³ Department of Learning Health Sciences, University of Michigan, Ann Arbor, MI, USA.
¹⁴ Department of Surgery, University of Kansas Medical Center, Kansas City, KS, USA.
¹⁵ Respiratory Rehabilitation Unit, Istituti Clinici Scientifici Maugeri Istituto di Ricovero e Cura a Carattere Scientifico, Pavia, Italy.
¹⁶ Research Informatics, University of Kansas Medical Center, Kansas City, KS, USA.
¹⁷ Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.
¹⁸ Laboratory of Medical Informatics and Artificial Intelligence, Istituti Clinici Scientifici Maugeri Istituto di Ricovero e Cura a Carattere Scientifico, Pavia, Italy.
¹⁹ Department of Internal Medicine and Frankel Cardiovascular Center, University of Michigan, Ann Arbor, MI, USA.
²⁰ Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Valencia, Spain.
²¹ Department of Biostatistics and Data Science, University of Kansas Medical Center, Kansas City, KS, USA.
²² Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA. Electronic address: isaac_kohane@hms.harvard.edu.

PMID: 39722251
PMCID: PMC12182955
DOI: 10.1016/S2589-7500(24)00246-2

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study

Maria Clara Saad Menezes et al. Lancet Digit Health. 2025 Jan.

. 2025 Jan;7(1):e35-e43.

doi: 10.1016/S2589-7500(24)00246-2.

Authors

Affiliations

¹ Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA; Department of Internal Medicine, University of Texas at Southwestern, Dallas, TX, USA.
² Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA.
³ Office of Informatics and Information Technology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, USA.
⁴ Computational Medicine and Bioinformatics, Internal Medicine, Human Genetics, Environmental Health, University of Michigan, Ann Arbor, MI, USA.
⁵ Division of Medical Informatics and Division of Pulmonary Critical Care and Sleep Medicine, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, KS, USA.
⁶ Department of Internal Medicine, University of Antioquia, Hospital Alma Máter de Antioquia, Medellín, Colombia.
⁷ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁸ Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA.
⁹ Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.
¹⁰ Department of Pediatrics, Boston Children's Hospital, Boston, MA, USA.
¹¹ Department of Surgery, National University Health System, Singapore.
¹² Department of Diagnostic Imaging, National University Health System, Singapore.
¹³ Department of Learning Health Sciences, University of Michigan, Ann Arbor, MI, USA.
¹⁴ Department of Surgery, University of Kansas Medical Center, Kansas City, KS, USA.
¹⁵ Respiratory Rehabilitation Unit, Istituti Clinici Scientifici Maugeri Istituto di Ricovero e Cura a Carattere Scientifico, Pavia, Italy.
¹⁶ Research Informatics, University of Kansas Medical Center, Kansas City, KS, USA.
¹⁷ Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.
¹⁸ Laboratory of Medical Informatics and Artificial Intelligence, Istituti Clinici Scientifici Maugeri Istituto di Ricovero e Cura a Carattere Scientifico, Pavia, Italy.
¹⁹ Department of Internal Medicine and Frankel Cardiovascular Center, University of Michigan, Ann Arbor, MI, USA.
²⁰ Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Valencia, Spain.
²¹ Department of Biostatistics and Data Science, University of Kansas Medical Center, Kansas City, KS, USA.
²² Department of Biomedical Informatics, Medical School, Harvard University, Boston, MA, USA. Electronic address: isaac_kohane@hms.harvard.edu.

PMID: 39722251
PMCID: PMC12182955
DOI: 10.1016/S2589-7500(24)00246-2

Abstract

Background: Patient notes contain substantial information but are difficult for computers to analyse due to their unstructured format. Large-language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4), have changed our ability to process text, but we do not know how effectively they handle medical notes. We aimed to assess the ability of GPT-4 to answer predefined questions after reading medical notes in three different languages.

Methods: For this retrospective model-evaluation study, we included eight university hospitals from four countries (ie, the USA, Colombia, Singapore, and Italy). Each site submitted seven de-identified medical notes related to seven separate patients to the coordinating centre between June 1, 2023, and Feb 28, 2024. Medical notes were written between Feb 1, 2020, and June 1, 2023. One site provided medical notes in Spanish, one site provided notes in Italian, and the remaining six sites provided notes in English. We included admission notes, progress notes, and consultation notes. No discharge summaries were included in this study. We advised participating sites to choose medical notes that, at time of hospital admission, were for patients who were male or female, aged 18-65 years, had a diagnosis of obesity, had a diagnosis of COVID-19, and had submitted an admission note. Adherence to these criteria was optional and participating sites randomly chose which medical notes to submit. When entering information into GPT-4, we prepended each medical note with an instruction prompt and a list of 14 questions that had been chosen a priori. Each medical note was individually given to GPT-4 in its original language and in separate sessions; the questions were always given in English. At each site, two physicians independently validated responses by GPT-4 and responded to all 14 questions. Each pair of physicians evaluated responses from GPT-4 to the seven medical notes from their own site only. Physicians were not masked to responses from GPT-4 before providing their own answers, but were masked to responses from the other physician.

Findings: We collected 56 medical notes, of which 42 (75%) were in English, seven (13%) were in Italian, and seven (13%) were in Spanish. For each medical note, GPT-4 responded to 14 questions, resulting in 784 responses. In 622 (79%, 95% CI 76-82) of 784 responses, both physicians agreed with GPT-4. In 82 (11%, 8-13) responses, only one physician agreed with GPT-4. In the remaining 80 (10%, 8-13) responses, neither physician agreed with GPT-4. Both physicians agreed with GPT-4 more often for medical notes written in Spanish (86 [88%, 95% CI 79-93] of 98 responses) and Italian (82 [84%, 75-90] of 98 responses) than in English (454 [77%, 74-80] of 588 responses).

Interpretation: The results of our model-evaluation study suggest that GPT-4 is accurate when analysing medical notes in three different languages. In the future, research should explore how LLMs can be integrated into clinical workflows to maximise their use in health care.

Funding: None.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests DAH receives a grant from the US National Institutes of Health (NIH) National Center for Advancing Translational Sciences (UM1TRO04404). DRM receives support from the National Center for Advancing Translational Sciences of the US NIH (UL1TR002366). EP and RB receive a grant from the EU Horizon 2020 Project PERISCOPE (101016233). GSO receives grants from the US NIH (U24CA271037 and P30ES017885). VLM receives financial support, paid to his institution, from Siemens Healthineers and the Melvyn Rubenfire Professorship in Preventive Cardiology and receives grants from the US NIH (R01AG059729, R01HL136685, and U01DK123013) and the American Heart Association Strategically Focused Research Network (20SFRN35120123). ZX receives grants from the National Institute of Neurological Disorders and Stroke of the US NIH (R01NS098023 and R01NS124882). All other authors declare no competing interests.

Figures

**Figure 1:. Study design**
(A) Study processes. (B) Instruction prompts given to GPT-4 with 14 questions chosen by our study group before submission of medical notes. In the prompt, the term medical_note was replaced by the actual medical note. Please note that questions and prompts are verbatim questions and prompts asked to GPT-4, and have not been edited. GPT-4=Generative Pre-trained Transformer 4. USPSTF=US Preventive Services Task Force.

**Figure 2:. Clinical validation of responses from GPT-4**
(A) Agreement of two separate physicians with GPT-4. (B) Agreement with responses by site. (C) Agreement with responses by language. (D) Agreement with results by question. The full list of questions is provided in figure 1. Both agreed refers to when both physicians answered Yes to the question “Do you agree with GPT-4’s answer?” after reading its response. One agreed refers to when one physician agreed with the response from GPT-4, but the other did not. Neither agreed refers to when both physicians did not agree with the response from GPT-4. GPT-4=Generative Pre-trained Transformer 4. *p<0·0001. †p=0·0073.

**Figure 3:. Agreement between physicians and GPT-4**
Nodes do not sum to 100% due to rounding. Both agreed refers to when both physicians answered Yes to the question “Do you agree with GPT-4’s answer?” after reading its response. One agreed refers to when one physician agreed with the response from GPT-4, but the other did not. Neither agreed refers to when both physicians did not agree with the response from GPT-4. GPT-4=Generative Pre-trained Transformer 4.

**Figure 4:. Ability of GPT-4 to select patients for hypothetical study enrolment**
(A) Inclusion criteria. We only included questions for which both physicians agreed or disagreed with GPT-4 in this analysis. Please note that questions are verbatim questions asked to GPT-4 and to physicians, and have not been edited. (B) Sensitivity and specificity, by inclusion criteria. Denominators for each criterion are numerators for that criterion in the column showing agreement between physicians in panel A. For example, we considered 51 notes in the analysis for age, 34 of which were within the age range of the study and were used to calculate sensitivity and 17 of which were not within the age range of the study and were used to calculate specificity. (C) Proportion of times GPT-4 was able to correctly provide all four inclusion criteria in a single patient. (D) Proportion of times GPT-4 was able to correctly provide all three inclusion criteria (excluding admission note) in a single patient. GPT-4=Generative Pre-trained Transformer 4. NA=not applicable.

See this image and copyright information in PMC

Comment in

A long STANDING commitment to improving health care.
The Lancet Digital Health. The Lancet Digital Health. Lancet Digit Health. 2025 Jan;7(1):e1. doi: 10.1016/j.landig.2024.12.005. Epub 2024 Dec 18. Lancet Digit Health. 2025. PMID: 39701920 No abstract available.

References

1. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med Inform 2019; 7: e12239. - PMC - PubMed
1. Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. arXiv 2022; published online Nov 30. https://arxiv.org/abs/2205.12689 (preprint).
1. Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decisionmaking. Nat Med 2024; 30: 2613–22. - PMC - PubMed
1. Russell LB. Electronic health records: the signal and the noise. Med Decis Making 2021; 41: 103–06. - PubMed
1. Ahsan H, McInerney DJ, Kim J, et al. Retrieving evidence from EHRs with LLMs: possibilities and challenges. arXiv 2024; published online June 10. https://arxiv.org/abs/2205.12689 (preprint). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study

Affiliations

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources