Using Open-Source Large Language Models to Identify Access to Germline Genetic Testing in Veterans With Breast Cancer From Unstructured Text
- PMID: 40694781
- PMCID: PMC12303249
- DOI: 10.1200/CCI-24-00263
Using Open-Source Large Language Models to Identify Access to Germline Genetic Testing in Veterans With Breast Cancer From Unstructured Text
Abstract
Purpose: The ability of large language models (LLMs) to identify access to germline genetic testing from unstructured text remains unknown. The Department of Veterans Affairs (VA) assessed access in Veterans with breast cancer by implementing and evaluating the performance of open-source, locally deployable LLMs (Llama 3 70B, Llama 3 8B, and Llama 2 70B) in identifying access from clinical/consult notes.
Methods: We identified a cohort of 1,201 Veterans diagnosed with breast cancer between January 1, 2021, and December 31, 2022, who received cancer care within the nationwide VA system and had clinical and/or consult notes available. Notes from a subset of 200 randomly selected patients, reviewed by subject-matter experts to identify access to testing, were split into development and testing sets, and various hyperparameters and prompting approaches were applied. We evaluated LLM performance using accuracy, precision, recall, and F1, with expert consensus on the labeled subset serving as ground truth. We compared LLM-identified access distribution in the entire cohort with expert-identified access in the labeled subset using the chi-squared test.
Results: Llama 3 70B achieved an F1 score of 0.912 (95% CI, 0.853 to 0.971), besting Llama 3 8B (F1: 0.811; 95% CI, 0.720 to 0.901) and significantly outperforming Llama 2 70B (F1: 0.644; 95% CI, 0.514 to 0.773; the test set target variable prevalence was 0.72.) We observed no significant difference between the performance of Llama 3 70B and that of the average individual expert reviewer, nor between LLM-identified access distribution across the entire cohort and expert-identified distribution in the labeled subset.
Conclusion: An open-source, locally deployable LLM effectively and efficiently identified germline genetic testing access from clinical notes. LLMs may enhance care quality and efficiency, while safeguarding sensitive data.
Conflict of interest statement
The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to
Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (
No other potential conflicts of interest were reported.
Figures
Similar articles
-
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2. Eur Radiol Exp. 2025. PMID: 40536631 Free PMC article.
-
Large Language Model Symptom Identification From Clinical Text: Multicenter Study.J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984. J Med Internet Res. 2025. PMID: 40743494 Free PMC article.
-
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929. J Med Internet Res. 2025. PMID: 40532199 Free PMC article.
-
The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.Cochrane Database Syst Rev. 2024 Dec 16;12(12):CD014780. doi: 10.1002/14651858.CD014780. Cochrane Database Syst Rev. 2024. PMID: 39679851 Free PMC article.
-
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3. Cochrane Database Syst Rev. 2022. PMID: 35593186 Free PMC article.
References
-
- Trayes KP, Cokenakes SEH: Breast cancer treatment. Am Fam Physician 104:171-178, 2021 - PubMed
-
- Desai NV, Yadav S, Batalini F, et al. : Germline genetic testing in breast cancer: Rationale for the testing of all women diagnosed by the age of 60 years and for risk-based testing of those older than 60 years. Cancer 127:828-833, 2021 - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Medical