Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Aug 12:2025.08.07.25333172.
doi: 10.1101/2025.08.07.25333172.

Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records

Affiliations

Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records

Clara Frydman-Gani et al. medRxiv. .

Abstract

The accurate detection of clinical phenotypes from electronic health records (EHRs) is pivotal for advancing large-scale genetic and longitudinal studies in psychiatry. Free-text clinical notes are an essential source of symptom-level information, particularly in psychiatry. However, the automated extraction of symptoms from clinical text remains challenging. Here, we tested 11 open-source generative large language models (LLMs) for their ability to detect 109 psychiatric phenotypes from clinical text, using annotated EHR notes from a psychiatric clinic in Colombia. The LLMs were evaluated both "out-of-the-box" and after fine-tuning, and compared against a traditional natural language processing (tNLP) method developed from the same data. We show that while base LLM performance was poor to moderate (0.2-0.6 macro-F1 for zero-shot; 0.2-0.74 macro-F1 for few shot), it improved significantly after fine-tuning (0.75-0.86 macro-F1), with several fine-tuned LLMs outperforming the tNLP method. In total, 100 phenotypes could be reliably detected (F1>0.8) using either a fine-tuned LLM or tNLP. To generate a fine-tuned LLM that can be shared with the scientific and medical community, we created a fully synthetic dataset free of patient information but based on original annotations. We fine-tuned a top-performing LLM on this data, creating "Mistral-small-psych", an LLM that can detect psychiatric phenotypes from Spanish text with performance comparable to that of LLMs trained on real EHR data (macro-F1=0.79). Finally, the fine-tuned LLMs underwent an external validation using data from a large psychiatric hospital in Colombia, the Hospital Mental de Antioquia, highlighting that most LLMs generalized well (0.02-0.16 point loss in macro-F1). Our study underscores the value of domain-specific adaptation of LLMs and introduces a new model for accurate psychiatric phenotyping in Spanish text, paving the way for global precision psychiatry.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: All authors declare no financial or non-financial competing interests.

Figures

Figure 1:
Figure 1:. Overview of extracted phenotypes and study design.
(a) Examples of psychiatric phenotypes extracted from EHR in this study (the full list can be found in Supplementary Table 1). (b) LLM application overview: LLMs were evaluated on the detection of 136 psychiatric phenotypes from EHR documents in zero-and few-shot settings (with the latter including up to five annotated examples of the phenotype of interest in the prompt); additionally, LLMs were fine-tuned using a training set of 1,477 annotated EHR documents.
Figure 2:
Figure 2:. Mean performance (macro-F1) for all models
Mean LLM performance (macro-F1) on the CSJDM test set (n=358 documents), in zero-shot (light bars, left) and few-shot (medium bars, middle) settings, as well as after fine-tuning (dark bars, right) for all testable phenotypes (n=109 phenotypes). Error bars represent a standard deviation from the mean F1 score.
Figure 3:
Figure 3:. Per-phenotype performance of all fine-tuned LLMs, against frequency in the training set.
Per-phenotype F1 for all LLMs fine-tuned on CSJDM data (colored lines); phenotypes are ordered by their frequency in the training data (bottom: gray bars showing the relative frequency of each phenotype in the CSJDM training set, n=1,477 documents). Highlighted phenotypes include examples of phenotypes with phenotype-specific (“other addictions”, “poor response to psychotropic medications”, “adverse effects”) and model-specific low performance (“poor adherence”), as well as high performance across-models despite low frequency in the training data (“hypothymia”).
Figure 4:
Figure 4:. Overview of Mistral-small-psych’s creation and performance.
(a) Mistral-small-psych creation: Llama3-70B was prompted to generate synthetic clinical sentences containing published phenotype annotation examples extracted from real annotated EHR data. These synthetic sentences, which contained no patient information, were sampled and assembled into documents, further labeled using the tNLP system, and used as training data to fine-tune Mistral-small, creating Mistral-small-psych. (b) Phenotype-level performance (F1) of Mistral-small-psych (red), vs. all 11 LLMs evaluated in zero-shot (blue) and few-shot (purple) settings, with phenotypes ordered by Mistral-small-psych F1 score. (c) Performance and generalizability of all fine-tuned LLMs: in-domain (CSJDM, solid bars) vs. out-of-domain (HOMO, striped bars) macro-F1 scores across all fine-tuned models. Mistral-small-psych (right, red) was fine-tuned on synthetic data. For comparison purposes, results are presented for phenotypes with three or more instances in both the CSJDM and HOMO test sets (N phenotypes=97). Error bars represent a standard deviation from the mean F1 score.

References

    1. Smoller J. W. The Use of Electronic Health Records for Psychiatric Phenotyping and Genomics. Am J Med Genet B Neuropsychiatr Genet 177, 601–612 (2018). - PMC - PubMed
    1. Electronic health records and stratified psychiatry: bridge to precision treatment? | Neuropsychopharmacology. https://www.nature.com/articles/s41386-023-01724-y. - PMC - PubMed
    1. Optimising the use of electronic medical records for large scale research in psychiatry | Translational Psychiatry. https://www.nature.com/articles/s41398-024-02911-1. - PMC - PubMed
    1. Service S. K. et al. Predicting Diagnostic Conversion From Major Depressive Disorder to Bipolar Disorder: An EHR Based Study From Colombia. Bipolar Disorders 27, 47–56 (2025). - PMC - PubMed
    1. Garriga R. et al. Combining clinical notes with structured electronic health records enhances the prediction of mental health crises. Cell Rep Med 4, 101260 (2023). - PMC - PubMed

Publication types

LinkOut - more resources