Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep;196(Pt B):110745.
doi: 10.1016/j.compbiomed.2025.110745. Epub 2025 Jul 30.

CDE-Mapper: Using retrieval-augmented language models for linking clinical data elements to controlled vocabularies

Affiliations
Free article

CDE-Mapper: Using retrieval-augmented language models for linking clinical data elements to controlled vocabularies

Komal Gilani et al. Comput Biol Med. 2025 Sep.
Free article

Abstract

The standardization of clinical data elements (CDEs) aims to ensure consistent and comprehensive patient information across various healthcare systems. Existing methods often falter when standardizing CDEs of varying representation and complex structure, impeding data integration and interoperability in clinical research. This paper presents CDE-Mapper, a framework that combines a retrieval-augmented generation strategy with large language models to automate the alignment of CDEs with controlled vocabularies. Our modular approach features query decomposition to manage varying levels of CDEs complexity, integrates expert-defined rules within prompt engineering, and employs in-context learning alongside multiple retriever components to resolve terminological ambiguities. In addition, we propose a knowledge reservoir validated by a human-in-loop approach, achieving accurate concept linking for future applications while minimizing computational costs. For four diverse datasets, CDE-Mapper achieved an average of 7.2% higher accuracy improvement compared to baseline methods. This work highlights the potential of advanced language models in improving data harmonization and significantly advancing capabilities in clinical decision support systems and research.

Keywords: Clinical data elements; Controlled vocabularies; Metadata standardization; Retrieval-Augmented Generation; Tabular data annotation.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

MeSH terms

LinkOut - more resources