Developing Large Language Model-based Pipeline for Identification of Disease Diagnosis: A Case Study on Identifying Newly Diagnosed Multiple Myeloma and its Precursor Disease in Veterans Health Administration Electronic Health Records
- PMID: 41726535
- PMCID: PMC12919547
Developing Large Language Model-based Pipeline for Identification of Disease Diagnosis: A Case Study on Identifying Newly Diagnosed Multiple Myeloma and its Precursor Disease in Veterans Health Administration Electronic Health Records
Abstract
Accurately identifying disease diagnoses from electronic health records (EHRs) is crucial for clinical/biomedical research; however, this is challenging when diagnoses are complex and require data from several sources, e.g., multiple myeloma (MM) and its precursor condition, MGUS. Leveraging the national Veterans Health Administration EHRs, we developed and validated a large language model (LLM)-based pipeline that utilizes only clinical notes from randomly selected patients identified via ICD codes for MGUS/MM. Among the evaluated LLMs and alternative approaches, Llama-3-8B-based pipeline with prompt engineering achieved the best performance. This pipeline not only saved the preprocessing steps and shortened the overall processing time but also outperformed rule-based or machine learning-based methods for identifying MGUS and achieved comparable performance for MM, solely relying on clinical notes. Our work demonstrates that the developed LLM-based pipeline can efficiently and effectively identify MGUS/MM diagnoses to replace manual chart abstraction and rule- or machine learning-based natural language processing methods.
©2024 AMIA - All rights reserved.
Figures
References
-
- Mouhieddine TH, Weeks LD, Ghobrial IM. Monoclonal gammopathy of undetermined significance. Blood. Jun 6 2019;133(23):2484–2494. doi:10.1182/blood.2019846782. - PubMed
-
- Rajkumar SV, Dimopoulos MA, Palumbo A, et al. International Myeloma Working Group updated criteria for the diagnosis of multiple myeloma. Lancet Oncol. Nov2014;15(12):e538–48. doi:10.1016/S1470-2045(14)70442-5. - PubMed
-
- Wang M, Yu Y-C, Liu L, et al. Natural language processing of Veterans’ electronic health records to confirm diagnoses of monoclonal gammopathy of undetermined significance. Journal of Clinical Oncology. 2022;40(16):1557–1557. doi:10.1200/JCO.2022.40.16_suppl.1557.
MeSH terms
LinkOut - more resources
Medical