Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 15;29(4):609-618.
doi: 10.1093/jamia/ocab217.

Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative

Emily R Pfaff  1 Andrew T Girvin  2 Davera L Gabriel  3 Kristin Kostka  4 Michele Morris  5 Matvey B Palchuk  6 Harold P Lehmann  7 Benjamin Amor  2 Mark Bissell  2 Katie R Bradwell  2 Sigfried Gold  3 Stephanie S Hong  3 Johanna Loomba  8 Amin Manna  2 Julie A McMurry  9 Emily Niehaus  2 Nabeel Qureshi  2 Anita Walden  10 Xiaohan Tanner Zhang  11 Richard L Zhu  11 Richard A Moffitt  12 Melissa A Haendel  13 Christopher G Chute  14 N3C ConsortiumWilliam G AdamsShaymaa Al-ShukriAlfred AnzaloneAhmad BaghalTellen D BennettElmer V BernstamElmer V BernstamMark M BissellBrian BushThomas R CampionVictor CastroJack ChangDeepa D ChaudhariWenjin ChenSan ChuJames J CiminoKeith A CrandallMark CrooksSara J Deakyne DaviesJohn DiPalazzoDavid DorrDan EckrichSarah E EltingeDaniel G FortGeorge GolovkoSnehil GuptaMelissa A HaendelJanos G HajagosDavid A HanauerBrett M HarnettRonald HorswellNancy HuangSteven G JohnsonMichael KahnKamil KhanipovCurtis KielerKatherine Ruiz De LuzuriagaSarah MaidlowAshley MartinezJomol MathewJames C McClayGabriel McMahanBrian MelanconStephane MeystreLucio MieleHiroki MorizonoRay PabloLav PatelJimmy PhuongDaniel J PophamClaudia PulgarinCarlos SantosIndra Neil SarkarNancy SazoSoko SetoguchiSelvin SobySirisha SurampalliChristine SuverUma Maheswara Reddy VangalaShyam VisweswaranJames von OehsenKellie M WaltersLaura WileyDavid A WilliamsAdrian Zai
Affiliations

Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative

Emily R Pfaff et al. J Am Med Inform Assoc. .

Abstract

Objective: In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations.

Materials and methods: We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements.

Results: Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback.

Discussion: We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate.

Conclusion: By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require.

Keywords: COVID-19; data accuracy; electronic health records.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The N3C data ingestion and harmonization pipeline. Participating sites regularly submit data in their native CDM format to an ingest server. A parsing step validates whether the data are formatted properly and check the contents of the payload against its package description, or “manifest.” The pipeline then transforms the submitted data to the OMOP model; data provenance is automatically maintained such that transformed data can be traced back to source at any time. The transformed data are then reviewed for DQ by a team of subject matter experts using a suite of data characterization and visualization tools. Every week, the latest data from all sites passing DQ checks are published as a versioned “release” for use by investigators. DQ: data quality; N3C: National COVID Cohort Collaborative; OMOP: Observational Medical Outcomes Partnership.
Figure 2.
Figure 2.
Vital sign coverage visualization, N3C OMOP sites. This heatmap is representative of those that we sent to sites to provide them with benchmarked lab and vital coverage information. The rows represent concept sets for vital signs and the columns are individual sites. The cell colors reflect the z-score of the percentage of COVID inpatients at each site that have at least 1 lab or vital of that type recorded during their hospitalization. The bluer the color, the higher the percentage of COVID inpatients that have that vital sign at that site—redder shades mean a lower percentage of patients with that vital. Rows and columns are hierarchically clustered, bringing similar sites closer together, and similar vitals closer together. This visualization enables sites to compare their data coverage with other sites using the same data model. (Site numbers are anonymized and have been changed from the site numbers used inside the N3C Enclave.) N3C: National COVID Cohort Collaborative; OMOP: Observational Medical Outcomes Partnership.
Figure 3.
Figure 3.
Improved percentages of valid COVID-19 test results across 11 N3C sites. The 11 sites shown here each had initial N3C submissions with high numbers of invalid (null, nonstandard) COVID test results. As time moves forward (left to right on the x-axis), drastic improvements are made following feedback from N3C. The blue line and shaded area represent the mean and standard deviation across all sites. N3C: National COVID Cohort Collaborative.
Figure 4.
Figure 4.
In A, one site’s initial N3C submission had a proportion of visits of type inpatient far above that of similar sites; in B, 4 sites’ initial submissions had no (or nearly no) inpatient visits. Our feedback encouraged the sites to re-examine and remap their source-to-CDM visit type mappings. In these cases, proportions improved. The shaded area reflects the mean and standard deviation of all sites. N3C: National COVID Cohort Collaborative.
Figure 5.
Figure 5.
Comparing sites within centralized data. One of the most stark differences we have observed among different sites is the different ways that a “visit” (or encounter) can be defined. Indeed, inpatient visits at several N3C sites are made up of a number (at times hundreds) of “microvisits”—consults with different specialists, imaging, infusions, et cetera. Because sites define inpatient visits so differently, they are difficult to harmonize. Centralized data make it easier to compare how sites define visits and develop derivative variables to enable harmonization. N3C: National COVID Cohort Collaborative.

References

    1. Haendel MA, Chute CG, Bennett TD, et al.; N3C Consortium. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. J Am Med Inform Assoc 2020; 28 (3): 427–43. - PMC - PubMed
    1. Bennett TD, Moffitt RA, Hajagos JG, et al. Clinical Characterization and Prediction of Clinical Severity of SARS-CoV-2 Infection Among US Adults Using Data From the US National COVID Cohort Collaborative. JAMA Netw Open2021; 4 (7): e2116901.. doi:10.1001/jamanetworkopen.2021.16901 - PMC - PubMed
    1. National COVID Cohort Collaborative. N3C Cohort Exploration. https://covid.cd2h.org/dashboard/ Accessed Jun 28, 2021.
    1. NCATS. NIH COVID-19 Data Warehouse Data Transfer Agreement. 2020. https://ncats.nih.gov/files/NCATS_Data_Transfer_Agreement_05-11-2020_Upd...
    1. Vogt TM, Lafata JE, Tolsma DD, et al.The role of research in integrated health care systems: the HMO research network. Perm J 2004; 8: 10–7. - PMC - PubMed

Publication types

Grants and funding