. 2025 Nov:121:105957.

doi: 10.1016/j.ebiom.2025.105957. Epub 2025 Oct 14.

Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases

Leonardo Chimirri¹, J Harry Caufield², Yasemin Bridges³, Nicolas Matentzoglu⁴, Michael Gargano⁵, Mario Cazalla⁶, Shihan Chen⁷, Daniel Danis¹, Alexander J M Dingemans⁸, Klara Gehle⁹, Petra Gehle¹⁰, Adam S L Graefe¹, Weihong Gu¹¹, Markus S Ladewig¹², Pablo Lapunzina¹³, Julián Nevado¹³, Enock Niyonkuru¹⁴, Soichi Ogishima¹⁵, Dominik Seelow¹, Jair A Tenorio Castaño⁶, Marek Turnovec¹⁶, Bert B A de Vries⁸, Kai Wang⁷, Kyran Wissink¹⁷, Zafer Yüksel¹⁸, Gabriele Zucca¹⁹, Melissa A Haendel²⁰, Christopher J Mungall², Justin Reese²¹, Peter N Robinson²²

Affiliations

¹ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
² Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
³ William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK.
⁴ Semanticly, Athens, Greece.
⁵ The Jackson Laboratory for Genomic Medicine, Farmington CT, USA.
⁶ INGEMM-Idipaz, Institute of Medical and Molecular Genetics, Hospital Universitario La Paz, Madrid, Spain.
⁷ Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA.
⁸ Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands.
⁹ Medical University of Gdansk, ul. M. Skłodowskiej-Curie 3a, 80-210, Gdańsk, Poland.
¹⁰ Deutsches Herzzentrum der Charité, Berlin, Germany.
¹¹ Chinese HPO Consortium, Beijing, China.
¹² Department of Ophthalmology, University Clinic Marburg - Campus Fulda, Fulda, Germany.
¹³ INGEMM-Idipaz, Institute of Medical and Molecular Genetics, Hospital Universitario La Paz, Madrid, Spain; CIBERER, Centro de Investigación Biomédica en Red de Enfermedades Raras, Instituto de Salud Carlos (ISCIII), Madrid, Spain.
¹⁴ Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Trinity College, Hartford, CT, USA.
¹⁵ INGEM/ToMMo, Tohoku University, Miyagi, Japan.
¹⁶ Department of Biology and Medical Genetics, 2nd Faculty of Medicine, Charles University in Prague and Motol University Hospital, Prague, Czech Republic.
¹⁷ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany; Utrecht University, Utrecht, Netherlands.
¹⁸ Department of Human Genetics, Bioscientia Healthcare GmbH, Ingelheim, Germany.
¹⁹ Institute for Maternal and Child Health - IRCCS "Burlo Garofolo" - Trieste, Trieste, 34137, Italy.
²⁰ University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
²¹ Lawrence Berkeley National Laboratory, Berkeley, CA, USA. Electronic address: justinreese@lbl.gov.
²² Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany; The Jackson Laboratory for Genomic Medicine, Farmington CT, USA. Electronic address: peter.robinson@bih-charite.de.

PMID: 41092581
PMCID: PMC12552141
DOI: 10.1016/j.ebiom.2025.105957

Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases

Leonardo Chimirri et al. EBioMedicine. 2025 Nov.

. 2025 Nov:121:105957.

doi: 10.1016/j.ebiom.2025.105957. Epub 2025 Oct 14.

Authors

Affiliations

¹ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
² Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
³ William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK.
⁴ Semanticly, Athens, Greece.
⁵ The Jackson Laboratory for Genomic Medicine, Farmington CT, USA.
⁶ INGEMM-Idipaz, Institute of Medical and Molecular Genetics, Hospital Universitario La Paz, Madrid, Spain.
⁷ Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA.
⁸ Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands.
⁹ Medical University of Gdansk, ul. M. Skłodowskiej-Curie 3a, 80-210, Gdańsk, Poland.
¹⁰ Deutsches Herzzentrum der Charité, Berlin, Germany.
¹¹ Chinese HPO Consortium, Beijing, China.
¹² Department of Ophthalmology, University Clinic Marburg - Campus Fulda, Fulda, Germany.
¹³ INGEMM-Idipaz, Institute of Medical and Molecular Genetics, Hospital Universitario La Paz, Madrid, Spain; CIBERER, Centro de Investigación Biomédica en Red de Enfermedades Raras, Instituto de Salud Carlos (ISCIII), Madrid, Spain.
¹⁴ Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Trinity College, Hartford, CT, USA.
¹⁵ INGEM/ToMMo, Tohoku University, Miyagi, Japan.
¹⁶ Department of Biology and Medical Genetics, 2nd Faculty of Medicine, Charles University in Prague and Motol University Hospital, Prague, Czech Republic.
¹⁷ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany; Utrecht University, Utrecht, Netherlands.
¹⁸ Department of Human Genetics, Bioscientia Healthcare GmbH, Ingelheim, Germany.
¹⁹ Institute for Maternal and Child Health - IRCCS "Burlo Garofolo" - Trieste, Trieste, 34137, Italy.
²⁰ University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
²¹ Lawrence Berkeley National Laboratory, Berkeley, CA, USA. Electronic address: justinreese@lbl.gov.
²² Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany; The Jackson Laboratory for Genomic Medicine, Farmington CT, USA. Electronic address: peter.robinson@bih-charite.de.

PMID: 41092581
PMCID: PMC12552141
DOI: 10.1016/j.ebiom.2025.105957

Abstract

Background: Large language models (LLMs) are increasingly used medicine for diverse applications including differential diagnostic support. The training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking.

Methods: We created 4917 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 360 distinct genetic diseases with 2525 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, and the medically fine-tuned Meditron3-70B to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses.

Findings: For English, GPT-4o placed the correct diagnosis at the first rank 19.9% and within the top-3 ranks 27.0% of the time. In comparison, for the nine non-English languages tested here the correct diagnosis was placed at rank 1 between 16.9% and 20.6%, within top-3 between 25.4% and 28.6% of cases. The Meditron3 model placed the correct diagnosis within the first 3 ranks for 20.9% of cases in English and between 19.9% and 24.0% for the other nine languages.

Interpretation: The differential diagnostic performance of LLMs across a comprehensive corpus of rare-disease cases was largely consistent across the ten languages tested. This suggests that the utility of LLMs in clinical settings may extend to non-English clinical settings.

Funding: NHGRI 5U24HG011449, 5RM1HG010860, R01HD103805 and R24OD011883. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER). C.M., J.R. and J.H.C. were supported in part by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy (Contract No. DE-AC0205CH11231).

Keywords: Artificial intelligence; Genomic diagnostics; Global Alliance for Genomics and Health; Human phenotype ontology; Large language model; Phenopacket schema.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests MH is a co-founder of Alamya Health. D.S. received a payment by Sanofi for a presentation in a continuing medical education course about AI and rare diseases.

Figures

**Fig. 1**
**Templated system for generating prompts using translation of the HPO into 9 languages.** An excerpt of one prompt is shown. Words representing age, sex, onset, and phenotypes are colour-coded as indicated.

**Fig. 2**
**Differential diagnostic performance of GPT-4o in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish.** The percentage of cases in which GPT-4o place the correct diagnosis in rank 1 (Top-1), within the top three ranks (Top-3) or within the first ten ranks (Top-10) is shown.

**Fig. 3**
**Differential diagnostic performance of Meditron3-70B in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish.** The percentage of cases in which Meditron3-70B place the correct diagnosis in rank 1 (Top-1), within the top three ranks (Top-3) or within the first ten ranks (Top-10) is shown.

See this image and copyright information in PMC

Update of

Consistent Performance of GPT-4o in Rare Disease Diagnosis Across Nine Languages and 4967 Cases.
Chimirri L, Caufield JH, Bridges Y, Matentzoglu N, Gargano M, Cazalla M, Chen S, Danis D, Dingemans AJ, Gehle P, Graefe ASL, Gu W, Ladewig MS, Lapunzina P, Nevado J, Niyonkuru E, Ogishima S, Seelow D, Castaño JAT, Turnovec M, de Vries BB, Wang K, Wissink K, Yüksel Z, Zucca G, Haendel MA, Mungall CJ, Reese J, Robinson PN. Chimirri L, et al. medRxiv [Preprint]. 2025 Feb 28:2025.02.26.25322769. doi: 10.1101/2025.02.26.25322769. medRxiv. 2025. Update in: EBioMedicine. 2025 Nov;121:105957. doi: 10.1016/j.ebiom.2025.105957. PMID: 40061308 Free PMC article. Updated. Preprint.

References

1. Singhal K., Azizi S., Tu T., et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. - PMC - PubMed
1. Statistics of common crawl monthly archives by commoncrawl. https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
1. Hayase J., Liu A., Choi Y., Oh S., Smith N.A. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. 2024. Data mixture inference attack: BPE tokenizers reveal training data compositions.https://openreview.net/pdf?id=EHXyeImux0
1. Liu X., Wu J., Shao A., et al. Uncovering language disparity of ChatGPT on retinal vascular disease classification: cross-sectional study. J Med Internet Res. 2024;26 - PMC - PubMed
1. Sallam M., Al-Mahzoum K., Almutawaa R.A., et al. The performance of OpenAI ChatGPT-4 and google gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses. BMC Res Notes. 2024;17:247. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases

Affiliations

Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical