CDE-Mapper: Using retrieval-augmented language models for linking clinical data elements to controlled vocabularies
- PMID: 40743885
- DOI: 10.1016/j.compbiomed.2025.110745
CDE-Mapper: Using retrieval-augmented language models for linking clinical data elements to controlled vocabularies
Abstract
The standardization of clinical data elements (CDEs) aims to ensure consistent and comprehensive patient information across various healthcare systems. Existing methods often falter when standardizing CDEs of varying representation and complex structure, impeding data integration and interoperability in clinical research. This paper presents CDE-Mapper, a framework that combines a retrieval-augmented generation strategy with large language models to automate the alignment of CDEs with controlled vocabularies. Our modular approach features query decomposition to manage varying levels of CDEs complexity, integrates expert-defined rules within prompt engineering, and employs in-context learning alongside multiple retriever components to resolve terminological ambiguities. In addition, we propose a knowledge reservoir validated by a human-in-loop approach, achieving accurate concept linking for future applications while minimizing computational costs. For four diverse datasets, CDE-Mapper achieved an average of 7.2% higher accuracy improvement compared to baseline methods. This work highlights the potential of advanced language models in improving data harmonization and significantly advancing capabilities in clinical decision support systems and research.
Keywords: Clinical data elements; Controlled vocabularies; Metadata standardization; Retrieval-Augmented Generation; Tabular data annotation.
Copyright © 2025 The Authors. Published by Elsevier Ltd.. All rights reserved.
Conflict of interest statement
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Similar articles
-
A new AI-assisted data standard accelerates interoperability in biomedical research.medRxiv [Preprint]. 2024 Nov 7:2024.10.17.24315618. doi: 10.1101/2024.10.17.24315618. medRxiv. 2024. PMID: 39484274 Free PMC article. Preprint.
-
Prescription of Controlled Substances: Benefits and Risks.2025 Jul 6. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2025 Jul 6. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 30726003 Free Books & Documents.
-
CDEMapper: enhancing National Institutes of Health common data element use with large language models.J Am Med Inform Assoc. 2025 Jul 1;32(7):1130-1139. doi: 10.1093/jamia/ocaf064. J Am Med Inform Assoc. 2025. PMID: 40332956 Free PMC article.
-
Quality improvement strategies for diabetes care: Effects on outcomes for adults living with diabetes.Cochrane Database Syst Rev. 2023 May 31;5(5):CD014513. doi: 10.1002/14651858.CD014513. Cochrane Database Syst Rev. 2023. PMID: 37254718 Free PMC article.
-
Audit and feedback: effects on professional practice.Cochrane Database Syst Rev. 2025 Mar 25;3(3):CD000259. doi: 10.1002/14651858.CD000259.pub4. Cochrane Database Syst Rev. 2025. PMID: 40130784
LinkOut - more resources
Full Text Sources