Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 22:2024:1325-1334.
eCollection 2024.

Developing Large Language Model-based Pipeline for Identification of Disease Diagnosis: A Case Study on Identifying Newly Diagnosed Multiple Myeloma and its Precursor Disease in Veterans Health Administration Electronic Health Records

Affiliations

Developing Large Language Model-based Pipeline for Identification of Disease Diagnosis: A Case Study on Identifying Newly Diagnosed Multiple Myeloma and its Precursor Disease in Veterans Health Administration Electronic Health Records

Mei Wang et al. AMIA Annu Symp Proc. .

Abstract

Accurately identifying disease diagnoses from electronic health records (EHRs) is crucial for clinical/biomedical research; however, this is challenging when diagnoses are complex and require data from several sources, e.g., multiple myeloma (MM) and its precursor condition, MGUS. Leveraging the national Veterans Health Administration EHRs, we developed and validated a large language model (LLM)-based pipeline that utilizes only clinical notes from randomly selected patients identified via ICD codes for MGUS/MM. Among the evaluated LLMs and alternative approaches, Llama-3-8B-based pipeline with prompt engineering achieved the best performance. This pipeline not only saved the preprocessing steps and shortened the overall processing time but also outperformed rule-based or machine learning-based methods for identifying MGUS and achieved comparable performance for MM, solely relying on clinical notes. Our work demonstrates that the developed LLM-based pipeline can efficiently and effectively identify MGUS/MM diagnoses to replace manual chart abstraction and rule- or machine learning-based natural language processing methods.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of (a) EHR selection, (b) LLM-based pipeline development, validation, and (c) application
Figure 2.
Figure 2.
Comparison of LLM performance in the testing dataset

References

    1. Price LE, Shea K, Gephart S. The Veterans Affairs’s Corporate Data Warehouse: Uses and Implications for Nursing Research and Practice. Nurs Adm Q. Oct-Dec 2015;39(4):311–8. doi:10.1097/NAQ.0000000000000118. - PMC - PubMed
    1. Castaneda-Avila MA, Ulbricht CM, Epstein MM. Risk factors for monoclonal gammopathy of undetermined significance: a systematic review. Ann Hematol. Apr 2021;100(4):855–863. doi:10.1007/s00277-021-04400-7. - PMC - PubMed
    1. Mouhieddine TH, Weeks LD, Ghobrial IM. Monoclonal gammopathy of undetermined significance. Blood. Jun 6 2019;133(23):2484–2494. doi:10.1182/blood.2019846782. - PubMed
    1. Rajkumar SV, Dimopoulos MA, Palumbo A, et al. International Myeloma Working Group updated criteria for the diagnosis of multiple myeloma. Lancet Oncol. Nov2014;15(12):e538–48. doi:10.1016/S1470-2045(14)70442-5. - PubMed
    1. Wang M, Yu Y-C, Liu L, et al. Natural language processing of Veterans’ electronic health records to confirm diagnoses of monoclonal gammopathy of undetermined significance. Journal of Clinical Oncology. 2022;40(16):1557–1557. doi:10.1200/JCO.2022.40.16_suppl.1557.

LinkOut - more resources