Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Dec;225(6):532-537.
doi: 10.1192/bjp.2024.134.

Detection of suicidality from medical text using privacy-preserving large language models

Affiliations
Review

Detection of suicidality from medical text using privacy-preserving large language models

Isabella Catharina Wiest et al. Br J Psychiatry. 2024 Dec.

Abstract

Background: Attempts to use artificial intelligence (AI) in psychiatric disorders show moderate success, highlighting the potential of incorporating information from clinical assessments to improve the models. This study focuses on using large language models (LLMs) to detect suicide risk from medical text in psychiatric care.

Aims: To extract information about suicidality status from the admission notes in electronic health records (EHRs) using privacy-sensitive, locally hosted LLMs, specifically evaluating the efficacy of Llama-2 models.

Method: We compared the performance of several variants of the open source LLM Llama-2 in extracting suicidality status from 100 psychiatric reports against a ground truth defined by human experts, assessing accuracy, sensitivity, specificity and F1 score across different prompting strategies.

Results: A German fine-tuned Llama-2 model showed the highest accuracy (87.5%), sensitivity (83.0%) and specificity (91.8%) in identifying suicidality, with significant improvements in sensitivity and specificity across various prompt designs.

Conclusions: The study demonstrates the capability of LLMs, particularly Llama-2, in accurately extracting information on suicidality from psychiatric records while preserving data privacy. This suggests their application in surveillance systems for psychiatric emergencies and improving the clinical management of suicidality by improving systematic quality control and research.

Keywords: Large language models; electronic health records; natural language processing; psychiatric disorder detection; suicidality.

PubMed Disclaimer

Conflict of interest statement

J.N.K. declares consulting services for Owkin, France, DoMore Diagnostics, Norway, Panakeia, UK, Scailyte, Switzerland, Cancilico, Germany, Mindpeak, Germany, MultiplexDx, Slovakia, and Histofy, UK; furthermore he holds shares in StratifAI GmbH, Germany, has received a research grant by GSK, and has received honoraria by AstraZeneca, Bayer, Eisai, Janssen, MSD, BMS, Roche, Pfizer and Fresenius. I.C.W. received honoraria from AstraZeneca. U.L. participated in advisory boards and received honoraria by Janssen Cilag GmbH.

Figures

Fig. 1
Fig. 1
Experimental Setup. (a) The information extraction pipeline. The psychiatry reports (n = 100) were transferred to a csv table. Our pipeline then iterates over all reports with the predefined prompt and outputs a JavaScript Object Notation-File (JSON) file with all Large Language Model (LLM) outputs (PRED). The relevant classes (suicidality present: yes or no) were then extracted from the LLM output, which was more verbose in some cases. These outputs were then transferred to a pandas dataframe and automatically compared to the expert-based ground truth (GT). (b) The initial prompting strategy. One prompt and one report were given to the model at the same time. Every prompt contained a system prompt with general instructions and a specific question to the report (Instruction). (c) The chain-of-thought approach: the psychiatry report with our prompt was fed into the LLM, which generated a first output. With a second prompt and a predefined answering grammar, the model was fed its own output and again forced to generate a certain, json based output structure. This final output then underwent performance analysis. Icon Source: Midjourney.
Fig. 2
Fig. 2
Performance of German-language fine-tuned Llama-2 model. (a) Sensitivity and Specificity for five different prompting strategies. With P0, the model was simply asked to provide the answer if suicidality was present from the report, P1, P2 and P3 provided one, two or three examples to the model. P4 applied a chain-of-thought approach, where the model was asked twice, with the first model output as input for the second run. (b) Confusion matrix representing the performance of the Large Language Model (LLM) indicating the presence of suicidality based on the examined admission notes (n = 100) with a sensitivity of 83% as well as specificity of 92% for P3, a prompt that included three examples. (c) Bar chart showing the balanced accuracies for all models and prompt engineering attempts. Error bars show the 95% confidence interval of the bootstrapped samples.

References

    1. Winter NR, Blanke J, Leenings R, Ernsting J, Fisch L, Sarink K, et al. A systematic evaluation of machine learning–based biomarkers for major depressive disorder. JAMA Psychiatry 2024; 81: 386–95. - PMC - PubMed
    1. Koutsouleris N, Dwyer DB, Degenhardt F, Maj C, Urquijo-Castro MF, Sanfelici R, et al. Multimodal machine learning workflows for prediction of psychosis in patients with clinical high-risk syndromes and recent-onset depression. JAMA Psychiatry 2021; 78: 195–209. - PMC - PubMed
    1. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med 2023; 3(1): 141. - PMC - PubMed
    1. Wiest IC, Ferber D, Zhu J, van Treeck M, Meyer SK, Juglan R, et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit Med 2024; 7(1): 257. - PMC - PubMed
    1. Irving J, Patel R, Oliver D, Colling C, Pritchard M, Broadbent M, et al. Using natural language processing on electronic health records to enhance detection and prediction of psychosis risk. Schizophr Bull 2021; 47: 405–14. - PMC - PubMed