Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 23;7(1):17.
doi: 10.5334/egems.289.

A Data Element-Function Conceptual Model for Data Quality Checks

Affiliations

A Data Element-Function Conceptual Model for Data Quality Checks

James R Rogers et al. EGEMS (Wash DC). .

Abstract

Introduction: In aggregate, existing data quality (DQ) checks are currently represented in heterogeneous formats, making it difficult to compare, categorize, and index checks. This study contributes a data element-function conceptual model to facilitate the categorization and indexing of DQ checks and explores the feasibility of leveraging natural language processing (NLP) for scalable acquisition of knowledge of common data elements and functions from DQ checks narratives.

Methods: The model defines a "data element", the primary focus of the check, and a "function", the qualitative or quantitative measure over a data element. We applied NLP techniques to extract both from 172 checks for Observational Health Data Sciences and Informatics (OHDSI) and 3,434 checks for Kaiser Permanente's Center for Effectiveness and Safety Research (CESR).

Results: The model was able to classify all checks. A total of 751 unique data elements and 24 unique functions were extracted. The top five frequent data element-function pairings for OHDSI were Person-Count (55 checks), Insurance-Distribution (17), Medication-Count (16), Condition-Count (14), and Observations-Count (13); for CESR, they were Medication-Variable Type (175), Medication-Missing (172), Medication-Existence (152), Medication-Count (127), and Socioeconomic Factors-Variable Type (114).

Conclusions: This study shows the efficacy of the data element-function conceptual model for classifying DQ checks, demonstrates early promise of NLP-assisted knowledge acquisition, and reveals the great heterogeneity in the focus in DQ checks, confirming variation in intrinsic checks and use-case specific "fitness-for-use" checks.

Keywords: clinical data research networks; data quality; electronic healthcare records; knowledge acquisition; natural language processing.

PubMed Disclaimer

Conflict of interest statement

The authors have no competing interests to declare.

Figures

Figure 1
Figure 1
High-level overview of workflow with example DQ checks and their corresponding constructs, terms, and suggested domains. A data element is a focus of a DQ check (annotation is represented by “[[ data element ]]” for parsing); a function is the qualitative or quantitative evaluation over the data element (annotation is represented by “{{ function }}” for parsing). Each DQ check is essentially a function of a data element.
Figure 2
Figure 2
Horizontal bar charts of frequency of DQ check domains specific to OHDSI, overlaid with DQ harmonization categories. DQ Harmonization Categories brief descriptions: Completeness, Atemporal is the data’s presence in a particular context at an individual time point; Conformance, Calculation is the data’s compliance to constraints relating to computationally derived values from existing data; Conformance, Relational is the data’s compliance to structural constraints as it relates to physical database structure specifications (e.g., primary key and foreign key relationships); Plausibility, Atemporal is the data’s feasibility at an individual time point; Plausibility, Temporal is the data’s feasibility across a series of time points in a defined time period.
Figure 3
Figure 3
Horizontal bar charts of frequency of DQ check domains specific to CESR, overlaid with DQ harmonization categories. DQ Harmonization Categories brief descriptions: Completeness, Atemporal is the data’s presence in a particular context at an individual time point; Conformance, Calculation is the data’s compliance to constraints relating to computationally derived values from existing data; Conformance, Relational is the data’s compliance to structural constraints as it relates to physical database structure specifications (e.g., primary key and foreign key relationships); Conformance, Value is the data’s compliance to structural constraints as it relates to prespecified formatting constraints (e.g., data element is numeric); Plausibility, Atemporal is the data’s feasibility at an individual time point; Plausibility, Temporal is the data’s feasibility across a series of time points in a defined time period; Plausibility, Uniqueness is the data’s feasibility regarding duplication.
Figure 4
Figure 4
Heat maps of DQ check domains. Domains represented in both networks are indicated with an “*”.

References

    1. Raghupathi, W and Raghupathi, V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014; 2 DOI: 10.1186/2047-2501-2-3 - DOI - PMC - PubMed
    1. Schneeweiss, S. Learning from Big Health Care Data. N Engl J Med. 2014; 370: 2161–2163. DOI: 10.1056/NEJMp1401111 - DOI - PubMed
    1. Sun, M and Lipsitz, SR. Comparative effectiveness research methodology using secondary data: A starting user’s guide. Urol Oncol Semin Orig Investig; 2017. DOI: 10.1016/j.urolonc.2017.10.011 - DOI - PubMed
    1. Mentz, RJ, Hernandez, AF, Berdan, LG, et al. Good Clinical Practice Guidance and Pragmatic Clinical Trials: Balancing the Best of Both Worlds. Circulation. 2016; 133: 872–880. DOI: 10.1161/CIRCULATIONAHA.115.019902 - DOI - PMC - PubMed
    1. Schneeweiss, S, Eichler, H-G, Garcia-Altes, A, et al. Real World Data in Adaptive Biomedical Innovation: A Framework for Generating Evidence Fit for Decision-Making. Clin Pharmacol Ther. 2016; 100: 633–646. DOI: 10.1002/cpt.512 - DOI - PubMed

LinkOut - more resources