CDEMapper: enhancing National Institutes of Health common data element use with large language models
- PMID: 40332956
- PMCID: PMC12202029
- DOI: 10.1093/jamia/ocaf064
CDEMapper: enhancing National Institutes of Health common data element use with large language models
Abstract
Objective: Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop a CDE mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs.
Methods: We propose CDEMapper, a large language model (LLM)-powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has 3 core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 methods) with GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the best match for their data elements and value sets. We evaluate the tool's recommendation accuracy and usability against manual annotations and testing.
Results: CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. The evaluation results demonstrated that the augmented BM25 with GPT embeddings and a GPT ranker achieved the overall best performance. The usability test also highlighted the effectiveness and efficiency of our tool.
Discussions and conclusions: This work opens up the potential of using LLMs to assist with CDE mapping when aligning local data elements with NIH CDEs. Additionally, this effort helps researchers better understand the gaps between their data elements and NIH CDEs while promoting CDE reusability.
Keywords: common data element; data collection; data sharing; interoperability; large language model.
© The Author(s) 2025. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Conflict of interest statement
Authors have no competing interests to declare.
Figures





Similar articles
-
Mapping of Alzheimer's disease related data elements and the NIH Common Data Elements.BMC Med Inform Decis Mak. 2024 Apr 19;24(Suppl 3):103. doi: 10.1186/s12911-024-02500-8. BMC Med Inform Decis Mak. 2024. PMID: 38641585 Free PMC article.
-
Breaking Digital Health Barriers Through a Large Language Model-Based Tool for Automated Observational Medical Outcomes Partnership Mapping: Development and Validation Study.J Med Internet Res. 2025 May 15;27:e69004. doi: 10.2196/69004. J Med Internet Res. 2025. PMID: 40146872 Free PMC article.
-
The NIH HEAL pain common data elements (CDE): a great start but a long way to the finish line.Pain Med. 2025 Mar 1;26(3):146-155. doi: 10.1093/pm/pnae110. Pain Med. 2025. PMID: 39495148 Free PMC article. Review.
-
Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687. JMIR Med Inform. 2025. PMID: 40493668 Free PMC article.
-
What is the value of routinely testing full blood count, electrolytes and urea, and pulmonary function tests before elective surgery in patients with no apparent clinical indication and in subgroups of patients with common comorbidities: a systematic review of the clinical and cost-effective literature.Health Technol Assess. 2012 Dec;16(50):i-xvi, 1-159. doi: 10.3310/hta16500. Health Technol Assess. 2012. PMID: 23302507 Free PMC article.
References
-
- Rahbar MH, Lee M, Hessabi M, et al. Harmonization, data management, and statistical issues related to prospective multicenter studies in Ankylosing spondylitis (AS): experience from the Prospective Study Of Ankylosing Spondylitis (PSOAS) cohort. Contemp Clin Trials Commun. 2018;11:127-135. - PMC - PubMed
-
- Common Data Elements: Standardizing Data Collection—FAIR Data: Data Collection and Sharing. Accessed May, 2024. https://www.nlm.nih.gov/oet/ed/cde/tutorial/02-100.html
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous