CDEMapper: enhancing National Institutes of Health common data element use with large language models
- PMID: 40332956
- PMCID: PMC12202029
- DOI: 10.1093/jamia/ocaf064
CDEMapper: enhancing National Institutes of Health common data element use with large language models
Abstract
Objective: Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop a CDE mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs.
Methods: We propose CDEMapper, a large language model (LLM)-powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has 3 core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 methods) with GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the best match for their data elements and value sets. We evaluate the tool's recommendation accuracy and usability against manual annotations and testing.
Results: CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. The evaluation results demonstrated that the augmented BM25 with GPT embeddings and a GPT ranker achieved the overall best performance. The usability test also highlighted the effectiveness and efficiency of our tool.
Discussions and conclusions: This work opens up the potential of using LLMs to assist with CDE mapping when aligning local data elements with NIH CDEs. Additionally, this effort helps researchers better understand the gaps between their data elements and NIH CDEs while promoting CDE reusability.
Keywords: common data element; data collection; data sharing; interoperability; large language model.
© The Author(s) 2025. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Conflict of interest statement
Authors have no competing interests to declare.
Figures
References
-
- Rahbar MH, Lee M, Hessabi M, et al. Harmonization, data management, and statistical issues related to prospective multicenter studies in Ankylosing spondylitis (AS): experience from the Prospective Study Of Ankylosing Spondylitis (PSOAS) cohort. Contemp Clin Trials Commun. 2018;11:127-135. - PMC - PubMed
-
- Common Data Elements: Standardizing Data Collection—FAIR Data: Data Collection and Sharing. Accessed May, 2024. https://www.nlm.nih.gov/oet/ed/cde/tutorial/02-100.html
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous
