Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 11;17(4):e0266911.
doi: 10.1371/journal.pone.0266911. eCollection 2022.

Extract, transform, load framework for the conversion of health databases to OMOP

Affiliations

Extract, transform, load framework for the conversion of health databases to OMOP

Juan C Quiroz et al. PLoS One. .

Abstract

Common data models standardize the structures and semantics of health datasets, enabling reproducibility and large-scale studies that leverage the data from multiple locations and settings. The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) is one of the leading common data models. While there is a strong incentive to convert datasets to OMOP, the conversion is time and resource-intensive, leaving the research community in need of tools for mapping data to OMOP. We propose an extract, transform, load (ETL) framework that is metadata-driven and generic across source datasets. The ETL framework uses a new data manipulation language (DML) that organizes SQL snippets in YAML. Our framework includes a compiler that converts YAML files with mapping logic into an ETL script. Access to the ETL framework is available via a web application, allowing users to upload and edit YAML files via web editor and obtain an ETL SQL script for use in development environments. The structure of the DML maximizes readability, refactoring, and maintainability, while minimizing technical debt and standardizing the writing of ETL operations for mapping to OMOP. Our framework also supports transparency of the mapping process and reuse by different institutions.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Architecture of the extract, transform, load (ETL) framework.
A compiler converts mapping logic, written by organizing SQL snippets in YAML key-value pairs, to an ETL SQL script, which contains all the executable operations to map from the source database to OMOP.
Fig 2
Fig 2. Source to OMOP YAML file structure.
Rules for mapping from the CERNER PERSON table to the OMOP PERSON table, with rules defined for two columns of the OMOP PERSON table: year_of_birth and death_datetime. For each target table, the mapping rules are defined on a column-by-column basis using SQL snippets organized into YAML fields. Mapping a table requires three sections: (1) name of the OMOP table being mapped, (2) definition of primary keys used by the ETL framework to manage the load (insert) operations, and (3) mapping rules for columns in the OMOP table.
Fig 3
Fig 3. Definition of primary keys in YAML.
Definition of the primary key for the OMOP CONDITION_OCCURRENCE table, which is populated—in the illustrated example—with the data from two source tables: DIAGNOSIS and PROBLEM. For the framework to handle the load operations, the primary keys of each source table must be defined under the sources YAML field.
Fig 4
Fig 4. Use of PostgreSQL syntax to define mapping logic.
An example of mapping logic to populate the gender_concept_id and the gender_source_value columns in the OMOP PERSON table. The transformation to populate gender_concept_id uses a CASE statement (if-then-else). The transformation to populate gender_source_value uses the data from two source tables, with the constraint indicating how these two source tables are joined.
Fig 5
Fig 5. Example of complex mapping rules.
Two rules are used to map diagnosis records and problem records from multiple source tables to the condition_start_date field in the OMOP CONDITION_OCCURRENCE table. Multiple rules can be written to map source data to a single OMOP column, dividing complex logic into queries that are easier to read and debug.
Fig 6
Fig 6. Use of YAML anchors and aliases.
Rules for mapping from the CERNER PERSON table to the OMOP PERSON table, using YAML anchors to define a set of default values to be used by the column fields.

References

    1. Harron K, Dibben C, Boyd J, Hjern A, Azimaee M, Barreto ML, et al.. Challenges in administrative data linkage for research. Big Data Soc. 2017;4: 2053951717745678. doi: 10.1177/2053951717745678 - DOI - PMC - PubMed
    1. Casey JA, Schwartz BS, Stewart WF, Adler NE. Using Electronic Health Records for Population Health Research: A Review of Methods and Applications. Annu Rev Public Health. 2016;37: 61–81. doi: 10.1146/annurev-publhealth-032315-021353 - DOI - PMC - PubMed
    1. Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. Am J Epidemiol. 2016;183: 758–764. doi: 10.1093/aje/kwv254 - DOI - PMC - PubMed
    1. Abul-Husn NS, Kenny EE. Personalized Medicine and the Power of Electronic Health Records. Cell. 2019;177: 58–69. doi: 10.1016/j.cell.2019.02.039 - DOI - PMC - PubMed
    1. FitzHenry F, Resnic FS, Robbins SL, Denton J, Nookala L, Meeker D, et al.. Creating a Common Data Model for Comparative Effectiveness with the Observational Medical Outcomes Partnership. Appl Clin Inform. 2015;06: 536–547. doi: 10.4338/ACI-2014-12-CR-0121 - DOI - PMC - PubMed

Publication types