This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Feb 28:2025.02.26.25322769.

doi: 10.1101/2025.02.26.25322769.

Consistent Performance of GPT-4o in Rare Disease Diagnosis Across Nine Languages and 4967 Cases

Leonardo Chimirri¹, J Harry Caufield², Yasemin Bridges³, Nicolas Matentzoglu⁴, Michael Gargano⁵, Mario Cazalla⁶, Shihan Chen⁷, Daniel Danis¹, Alexander Jm Dingemans⁸, Petra Gehle⁹, Adam S L Graefe¹, Weihong Gu¹⁰, Markus S Ladewig¹¹, Pablo Lapunzina⁶, Julián Nevado⁶, Enock Niyonkuru^{2

12}, Soichi Ogishima¹³, Dominik Seelow¹, Jair A Tenorio Castaño⁶, Marek Turnovec¹⁴, Bert Ba de Vries⁸, Kai Wang⁷, Kyran Wissink^{1

15}, Zafer Yüksel¹⁶, Gabriele Zucca¹⁷, Melissa A Haendel¹⁸, Christopher J Mungall², Justin Reese², Peter N Robinson^{1

5}

Affiliations

¹ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
² Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
³ William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK.
⁴ Semanticly, Athens, Greece.
⁵ The Jackson Laboratory for Genomic Medicine.
⁶ INGEMM-Idipaz, Institute of Medical and Molecular Genetics, Hospital Universitario La Paz, Madrid, Spain.
⁷ Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA.
⁸ Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands.
⁹ Deutsches Herzzentrum der Charité, Berlin, Germany.
¹⁰ Chinese HPO Consortium, Beijing, China.
¹¹ Department of Ophthalmology, University Clinic Marburg - Campus Fulda, Fulda, Germany CIBERER-ISCIII, Madrid, Spain.
¹² Trinity College, Hartford, CT, USA.
¹³ INGEM/ToMMo, Tohoku University, Miyagi, Japan.
¹⁴ Department of Biology and Medical Genetics, 2nd Faculty of Medicine, Charles University in Prague and Motol University Hospital, Prague, Czech Republic.
¹⁵ Utrecht University, Utrecht, Netherlands.
¹⁶ Department of Human Genetics, Bioscientia Healthcare GmbH, Ingelheim, Germany.
¹⁷ Institute for Maternal and Child Health - IRCCS "Burlo Garofolo" - Trieste, Trieste 34137, Italy.
¹⁸ University of North Carolina at Chapel Hill.

PMID: 40061308
PMCID: PMC11888497
DOI: 10.1101/2025.02.26.25322769

Consistent Performance of GPT-4o in Rare Disease Diagnosis Across Nine Languages and 4967 Cases

Leonardo Chimirri et al. medRxiv. 2025.

[Preprint]. 2025 Feb 28:2025.02.26.25322769.

doi: 10.1101/2025.02.26.25322769.

Authors

Affiliations

¹ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
² Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
³ William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK.
⁴ Semanticly, Athens, Greece.
⁵ The Jackson Laboratory for Genomic Medicine.
⁶ INGEMM-Idipaz, Institute of Medical and Molecular Genetics, Hospital Universitario La Paz, Madrid, Spain.
⁷ Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA.
⁸ Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands.
⁹ Deutsches Herzzentrum der Charité, Berlin, Germany.
¹⁰ Chinese HPO Consortium, Beijing, China.
¹¹ Department of Ophthalmology, University Clinic Marburg - Campus Fulda, Fulda, Germany CIBERER-ISCIII, Madrid, Spain.
¹² Trinity College, Hartford, CT, USA.
¹³ INGEM/ToMMo, Tohoku University, Miyagi, Japan.
¹⁴ Department of Biology and Medical Genetics, 2nd Faculty of Medicine, Charles University in Prague and Motol University Hospital, Prague, Czech Republic.
¹⁵ Utrecht University, Utrecht, Netherlands.
¹⁶ Department of Human Genetics, Bioscientia Healthcare GmbH, Ingelheim, Germany.
¹⁷ Institute for Maternal and Child Health - IRCCS "Burlo Garofolo" - Trieste, Trieste 34137, Italy.
¹⁸ University of North Carolina at Chapel Hill.

PMID: 40061308
PMCID: PMC11888497
DOI: 10.1101/2025.02.26.25322769

Abstract

Background: Large language models (LLMs) are increasingly used in the medical field for diverse applications including differential diagnostic support. The estimated training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking.

Methods: We created 4967 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 378 distinct genetic diseases with 2618 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses.

Findings: For English, GPT-4o placed the correct diagnosis at the first rank 19·8% and within the top-3 ranks 27·0% of the time. In comparison, for the eight non-English languages tested here the correct diagnosis was placed at rank 1 between 16·9% and 20·5%, within top-3 between 25·3% and 27·7% of cases.

Interpretation: The differential diagnostic performance of GPT-4o across a comprehensive corpus of rare-disease cases was consistent across the nine languages tested. This suggests that LLMs such as GPT-4o may have utility in non-English clinical settings.

Funding: NHGRI 5U24HG011449 and 5RM1HG010860. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPRECISC-III, Fondos FEDER).

PubMed Disclaimer

Conflict of interest statement

Declaration of interests MH is a co-founder of Alamya Health.

Figures

**Figure 1.. Templated system for generating prompts using translation of the HPO into 8 languages.**
An excerpt of one prompt is shown. Shading indicates Age, Sex, Onset, HPO Phenotypes.

**Figure 2.. Differential diagnostic performance of GPT-4o in English, Chinese, Czech, Dutch, German, Italian, Japanese, Spanish, and Turkish.**
The percentage of cases in which GPT-4o place the correct diagnosis in rank 1 (Top-1), within the top three ranks (Top-3) or within the first ten ranks (Top-10) is shown.

See this image and copyright information in PMC

References

1. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature 2023; 620: 172–80. - PMC - PubMed
1. Statistics of Common Crawl Monthly Archives by commoncrawl. https://commoncrawl.github.io/cc-crawl-statistics/plots/languages (accessed Feb 17, 2025).
1. Hayase J, Liu A, Choi Y, Oh S, Smith NA. Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? 2024; published online July 23. http://arxiv.org/abs/2407.16607 (accessed Feb 18, 2025).
1. Liu X, Wu J, Shao A, et al. Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study. J Med Internet Res 2024; 26: e51926. - PMC - PubMed
1. Lai VD, Ngo NT, Veyseh APB, et al. ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. 2023; published online April 12. http://arxiv.org/abs/2304.05613 (accessed Jan 31, 2025).

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Consistent Performance of GPT-4o in Rare Disease Diagnosis Across Nine Languages and 4967 Cases

Affiliations

Consistent Performance of GPT-4o in Rare Disease Diagnosis Across Nine Languages and 4967 Cases

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources