Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;32(7):1130-1139.
doi: 10.1093/jamia/ocaf064.

CDEMapper: enhancing National Institutes of Health common data element use with large language models

Affiliations

CDEMapper: enhancing National Institutes of Health common data element use with large language models

Yan Wang et al. J Am Med Inform Assoc. .

Abstract

Objective: Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop a CDE mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs.

Methods: We propose CDEMapper, a large language model (LLM)-powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has 3 core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 methods) with GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the best match for their data elements and value sets. We evaluate the tool's recommendation accuracy and usability against manual annotations and testing.

Results: CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. The evaluation results demonstrated that the augmented BM25 with GPT embeddings and a GPT ranker achieved the overall best performance. The usability test also highlighted the effectiveness and efficiency of our tool.

Discussions and conclusions: This work opens up the potential of using LLMs to assist with CDE mapping when aligning local data elements with NIH CDEs. Additionally, this effort helps researchers better understand the gaps between their data elements and NIH CDEs while promoting CDE reusability.

Keywords: common data element; data collection; data sharing; interoperability; large language model.

PubMed Disclaimer

Conflict of interest statement

Authors have no competing interests to declare.

Figures

Figure 1.
Figure 1.
Overview of CDEMapper architecture.
Figure 2.
Figure 2.
Prompts designed for query expanding, CDE re-ranking, and value mapping.
Figure 3.
Figure 3.
The user-centered mapping workflow.
Figure 4.
Figure 4.
The user interface of CDEMapper, including (A) a ribbon menu for mapping actions, (B) a panel for displaying source data element awaiting mapping, (C) a panel displaying the candidate Top 10 target CDEs, and (D) an area under selected CDEs that for mapping source values to target values. CDEMapper is publicly available. A detailed user manual is accessible within the tool and on GitHub.
Figure 5.
Figure 5.
The accuracy performance for the BM25 baseline and BM25 with GPT embedding (B&E) methods under different mapping settings, including (A) the accuracy of 1 vs 1 mapping setting; (B) the accuracy of M vs 1 mapping setting; (C) the accuracy of 1 vs M mapping setting; and (D) the accuracy of overall mapping setting.

Similar articles

References

    1. Rahbar MH, Lee M, Hessabi M, et al. Harmonization, data management, and statistical issues related to prospective multicenter studies in Ankylosing spondylitis (AS): experience from the Prospective Study Of Ankylosing Spondylitis (PSOAS) cohort. Contemp Clin Trials Commun. 2018;11:127-135. - PMC - PubMed
    1. Wey TW, Doiron D, Wissa R, et al. Overview of retrospective data harmonisation in the MINDMAP project: process and results. J Epidemiol Community Health. 2021;75:433-441. - PMC - PubMed
    1. Poole N, Schmidt RA, Bocking A, Bergeron J, Fortier I. The potential for fetal alcohol spectrum disorder prevention of a harmonized approach to data collection about alcohol use in pregnancy cohort studies. Int J Environ Res Public Health. 2019;16:2019. - PMC - PubMed
    1. Kaneko T, Vemulapalli S, Kohsaka S, et al. Practice patterns and outcomes of transcatheter aortic valve replacement in the United States and Japan: a report from joint data harmonization initiative of STS/ACC TVT and J‐TVT. J Am Heart Assoc. 2022;11:e023848. - PMC - PubMed
    1. Common Data Elements: Standardizing Data Collection—FAIR Data: Data Collection and Sharing. Accessed May, 2024. https://www.nlm.nih.gov/oet/ed/cde/tutorial/02-100.html