Developing Large Language Model-based Pipeline for Identification of Disease Diagnosis: A Case Study on Identifying Newly Diagnosed Multiple Myeloma and its Precursor Disease in Veterans Health Administration Electronic Health Records

Mei Wang^{1

2}, Yuan-Hung Kuan^{1

2}, Patrik R Alba^{3

4}, Qiwei Gan^{3

4}, Martin W Schoen^{1

5}, Theodore S Thomas^{1

2}, Jr-Shin Li², Su-Hsin Chang^{1

2}

Affiliations

¹ Research Service, St. Louis Veterans Affairs Medical Center, St. Louis, MO.
² Washington University in St. Louis, St. Louis, MO.
³ Veterans Affairs Salt Lake City Health Care System, Salt Lake City, UT.
⁴ University of Utah School of Medicine, Salt Lake City, UT.
⁵ Saint Louis University School of Medicine, St. Louis, MO.

PMID: 41726535
PMCID: PMC12919547

Developing Large Language Model-based Pipeline for Identification of Disease Diagnosis: A Case Study on Identifying Newly Diagnosed Multiple Myeloma and its Precursor Disease in Veterans Health Administration Electronic Health Records

Mei Wang et al. AMIA Annu Symp Proc. 2025.

. 2025 May 22:2024:1325-1334.

eCollection 2024.

Authors

Mei Wang^{1

2}, Yuan-Hung Kuan^{1

2}, Patrik R Alba^{3

4}, Qiwei Gan^{3

4}, Martin W Schoen^{1

5}, Theodore S Thomas^{1

2}, Jr-Shin Li², Su-Hsin Chang^{1

2}

Affiliations

¹ Research Service, St. Louis Veterans Affairs Medical Center, St. Louis, MO.
² Washington University in St. Louis, St. Louis, MO.
³ Veterans Affairs Salt Lake City Health Care System, Salt Lake City, UT.
⁴ University of Utah School of Medicine, Salt Lake City, UT.
⁵ Saint Louis University School of Medicine, St. Louis, MO.

PMID: 41726535
PMCID: PMC12919547

Abstract

Accurately identifying disease diagnoses from electronic health records (EHRs) is crucial for clinical/biomedical research; however, this is challenging when diagnoses are complex and require data from several sources, e.g., multiple myeloma (MM) and its precursor condition, MGUS. Leveraging the national Veterans Health Administration EHRs, we developed and validated a large language model (LLM)-based pipeline that utilizes only clinical notes from randomly selected patients identified via ICD codes for MGUS/MM. Among the evaluated LLMs and alternative approaches, Llama-3-8B-based pipeline with prompt engineering achieved the best performance. This pipeline not only saved the preprocessing steps and shortened the overall processing time but also outperformed rule-based or machine learning-based methods for identifying MGUS and achieved comparable performance for MM, solely relying on clinical notes. Our work demonstrates that the developed LLM-based pipeline can efficiently and effectively identify MGUS/MM diagnoses to replace manual chart abstraction and rule- or machine learning-based natural language processing methods.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of (a) EHR selection, (b) LLM-based pipeline development, validation, and (c) application

**Figure 2.**
Comparison of LLM performance in the testing dataset

See this image and copyright information in PMC

References

1. Price LE, Shea K, Gephart S. The Veterans Affairs’s Corporate Data Warehouse: Uses and Implications for Nursing Research and Practice. Nurs Adm Q. Oct-Dec 2015;39(4):311–8. doi:10.1097/NAQ.0000000000000118. - PMC - PubMed
1. Castaneda-Avila MA, Ulbricht CM, Epstein MM. Risk factors for monoclonal gammopathy of undetermined significance: a systematic review. Ann Hematol. Apr 2021;100(4):855–863. doi:10.1007/s00277-021-04400-7. - PMC - PubMed
1. Mouhieddine TH, Weeks LD, Ghobrial IM. Monoclonal gammopathy of undetermined significance. Blood. Jun 6 2019;133(23):2484–2494. doi:10.1182/blood.2019846782. - PubMed
1. Rajkumar SV, Dimopoulos MA, Palumbo A, et al. International Myeloma Working Group updated criteria for the diagnosis of multiple myeloma. Lancet Oncol. Nov2014;15(12):e538–48. doi:10.1016/S1470-2045(14)70442-5. - PubMed
1. Wang M, Yu Y-C, Liu L, et al. Natural language processing of Veterans’ electronic health records to confirm diagnoses of monoclonal gammopathy of undetermined significance. Journal of Clinical Oncology. 2022;40(16):1557–1557. doi:10.1200/JCO.2022.40.16_suppl.1557.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Developing Large Language Model-based Pipeline for Identification of Disease Diagnosis: A Case Study on Identifying Newly Diagnosed Multiple Myeloma and its Precursor Disease in Veterans Health Administration Electronic Health Records

Affiliations

Developing Large Language Model-based Pipeline for Identification of Disease Diagnosis: A Case Study on Identifying Newly Diagnosed Multiple Myeloma and its Precursor Disease in Veterans Health Administration Electronic Health Records

Authors

Affiliations

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Medical