. 2024 Sep 1;31(9):1856-1864.

doi: 10.1093/jamia/ocae030.

Clinical risk prediction using language models: benefits and considerations

Angeela Acharya¹, Sulabh Shrestha¹, Anyi Chen², Joseph Conte², Sanja Avramovic¹, Siddhartha Sikdar¹, Antonios Anastasopoulos¹, Sanmay Das¹

Affiliations

¹ George Mason University, Fairfax, VA, United States.
² Staten Island Performing Provider System, Staten Island, NY, United States.

PMID: 38412328
PMCID: PMC11339498
DOI: 10.1093/jamia/ocae030

Clinical risk prediction using language models: benefits and considerations

Angeela Acharya et al. J Am Med Inform Assoc. 2024.

. 2024 Sep 1;31(9):1856-1864.

doi: 10.1093/jamia/ocae030.

Authors

Angeela Acharya¹, Sulabh Shrestha¹, Anyi Chen², Joseph Conte², Sanja Avramovic¹, Siddhartha Sikdar¹, Antonios Anastasopoulos¹, Sanmay Das¹

Affiliations

¹ George Mason University, Fairfax, VA, United States.
² Staten Island Performing Provider System, Staten Island, NY, United States.

PMID: 38412328
PMCID: PMC11339498
DOI: 10.1093/jamia/ocae030

Abstract

Objective: The use of electronic health records (EHRs) for clinical risk prediction is on the rise. However, in many practical settings, the limited availability of task-specific EHR data can restrict the application of standard machine learning pipelines. In this study, we investigate the potential of leveraging language models (LMs) as a means to incorporate supplementary domain knowledge for improving the performance of various EHR-based risk prediction tasks.

Methods: We propose two novel LM-based methods, namely "LLaMA2-EHR" and "Sent-e-Med." Our focus is on utilizing the textual descriptions within structured EHRs to make risk predictions about future diagnoses. We conduct a comprehensive comparison with previous approaches across various data types and sizes.

Results: Experiments across 6 different methods and 3 separate risk prediction tasks reveal that employing LMs to represent structured EHRs, such as diagnostic histories, results in significant performance improvements when evaluated using standard metrics such as area under the receiver operating characteristic (ROC) curve and precision-recall (PR) curve. Additionally, they offer benefits such as few-shot learning, the ability to handle previously unseen medical concepts, and adaptability to various medical vocabularies. However, it is noteworthy that outcomes may exhibit sensitivity to a specific prompt.

Conclusion: LMs encompass extensive embedded knowledge, making them valuable for the analysis of EHRs in the context of risk prediction. Nevertheless, it is important to exercise caution in their application, as ongoing safety concerns related to LMs persist and require continuous consideration.

Keywords: electronic health records; large language models; opioid use disorder; risk prediction; substance use disorder.

PubMed Disclaimer

Conflict of interest statement

The authors have no competing interests to declare.

Figures

**Figure 1.**
Representation of medical records for a single patient in a typical EHR: A visit may have a varying number of medical entities (ie, diagnosis, procedure, medications, etc.).

**Figure 2.**
Process of creating patient groups. Patients who had at least one OUD/SUD/Diabetes diagnosis are put into Case Group while those who did not have any OUD/SUD/Diabetes diagnosis are put into Control Group.

**Figure 3.**
High-level overview of the Sent-e-Med architecture: for each medical code, sentence embeddings and visit embeddings are extracted and subsequently combined before being fed into the transformer encoder as input.

**Figure 4.**
Illustration of 2 distinct prompts employed in the fine-tuning of the LLaMA2-EHR model. Prompt 1 aggregates the frequency of diagnosis occurrences across multiple visits, while Prompt 2 evaluates diagnoses on a per-visit basis and incorporates information about the intervals between visits. Red highlights in the text are employed to indicate patient-specific variations in the information.Note: Inputs in the prompts and the responses are just hypothetical examples.

**Figure 5.**
Examining the variations in LLaMA2-EHR responses when predicting the probability of Diabetes diagnosis based on 2 distinct inputs: one representing a simple hypothetical patient’s medical history (Input 1) and another involving additional diagnosis information that is known to be the risk factors of Diabetes (Input 2). The objective is to analyze how the likelihood of a “Yes” or “No” prediction changes within these specific scenarios.

See this image and copyright information in PMC

References

1. Pendergrass SA, Crawford DC. Using electronic health records to generate phenotypes for research. Curr Protoc Hum Genet. 2018;100(1):e80. - PMC - PubMed
1. Goldstein BA, Navar AM, Pencina MJ, et al. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2017;24(1):198––208.. - PMC - PubMed
1. Choi E, Bahadori MT, Song L, et al. Gram: Graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. Association for Computing Machinery. 2017:787–795. - PMC - PubMed
1. Shang J, Ma T, Xiao C, et al. Pre-training of graph augmented transformers for medication recommendation. CoRR, abs/1906.00346. 2019.
1. Hirsch JA, Nicola G, McGinty G, et al. ICD-10: History and Context. AJNR Am J Neuroradiol. 2016;37(4):596–599. 10.3174/ajnr.A4696 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clinical risk prediction using language models: benefits and considerations

Affiliations

Clinical risk prediction using language models: benefits and considerations

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials