Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov:121:105957.
doi: 10.1016/j.ebiom.2025.105957. Epub 2025 Oct 14.

Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases

Affiliations

Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases

Leonardo Chimirri et al. EBioMedicine. 2025 Nov.

Abstract

Background: Large language models (LLMs) are increasingly used medicine for diverse applications including differential diagnostic support. The training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking.

Methods: We created 4917 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 360 distinct genetic diseases with 2525 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, and the medically fine-tuned Meditron3-70B to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses.

Findings: For English, GPT-4o placed the correct diagnosis at the first rank 19.9% and within the top-3 ranks 27.0% of the time. In comparison, for the nine non-English languages tested here the correct diagnosis was placed at rank 1 between 16.9% and 20.6%, within top-3 between 25.4% and 28.6% of cases. The Meditron3 model placed the correct diagnosis within the first 3 ranks for 20.9% of cases in English and between 19.9% and 24.0% for the other nine languages.

Interpretation: The differential diagnostic performance of LLMs across a comprehensive corpus of rare-disease cases was largely consistent across the ten languages tested. This suggests that the utility of LLMs in clinical settings may extend to non-English clinical settings.

Funding: NHGRI 5U24HG011449, 5RM1HG010860, R01HD103805 and R24OD011883. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER). C.M., J.R. and J.H.C. were supported in part by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy (Contract No. DE-AC0205CH11231).

Keywords: Artificial intelligence; Genomic diagnostics; Global Alliance for Genomics and Health; Human phenotype ontology; Large language model; Phenopacket schema.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests MH is a co-founder of Alamya Health. D.S. received a payment by Sanofi for a presentation in a continuing medical education course about AI and rare diseases.

Figures

Fig. 1
Fig. 1
Templated system for generating prompts using translation of the HPO into 9 languages. An excerpt of one prompt is shown. Words representing age, sex, onset, and phenotypes are colour-coded as indicated.
Fig. 2
Fig. 2
Differential diagnostic performance of GPT-4o in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. The percentage of cases in which GPT-4o place the correct diagnosis in rank 1 (Top-1), within the top three ranks (Top-3) or within the first ten ranks (Top-10) is shown.
Fig. 3
Fig. 3
Differential diagnostic performance of Meditron3-70B in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. The percentage of cases in which Meditron3-70B place the correct diagnosis in rank 1 (Top-1), within the top three ranks (Top-3) or within the first ten ranks (Top-10) is shown.

Update of

References

    1. Singhal K., Azizi S., Tu T., et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. - PMC - PubMed
    1. Statistics of common crawl monthly archives by commoncrawl. https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
    1. Hayase J., Liu A., Choi Y., Oh S., Smith N.A. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. 2024. Data mixture inference attack: BPE tokenizers reveal training data compositions.https://openreview.net/pdf?id=EHXyeImux0
    1. Liu X., Wu J., Shao A., et al. Uncovering language disparity of ChatGPT on retinal vascular disease classification: cross-sectional study. J Med Internet Res. 2024;26 - PMC - PubMed
    1. Sallam M., Al-Mahzoum K., Almutawaa R.A., et al. The performance of OpenAI ChatGPT-4 and google gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses. BMC Res Notes. 2024;17:247. - PMC - PubMed