Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan;9(1):mgen000908.
doi: 10.1099/mgen.0.000908.

The DataHarmonizer: a tool for faster data harmonization, validation, aggregation and analysis of pathogen genomics contextual information

Affiliations

The DataHarmonizer: a tool for faster data harmonization, validation, aggregation and analysis of pathogen genomics contextual information

Ivan S Gill et al. Microb Genom. 2023 Jan.

Abstract

Pathogen genomics is a critical tool for public health surveillance, infection control, outbreak investigations as well as research. In order to make use of pathogen genomics data, they must be interpreted using contextual data (metadata). Contextual data include sample metadata, laboratory methods, patient demographics, clinical outcomes and epidemiological information. However, the variability in how contextual information is captured by different authorities and how it is encoded in different databases poses challenges for data interpretation, integration and their use/re-use. The DataHarmonizer is a template-driven spreadsheet application for harmonizing, validating and transforming genomics contextual data into submission-ready formats for public or private repositories. The tool's web browser-based JavaScript environment enables validation and its offline functionality and local installation increases data security. The DataHarmonizer was developed to address the data sharing needs that arose during the COVID-19 pandemic, and was used by members of the Canadian COVID Genomics Network (CanCOGeN) to harmonize SARS-CoV-2 contextual data for national surveillance and for public repository submission. In order to support coordination of international surveillance efforts, we have partnered with the Public Health Alliance for Genomic Epidemiology to also provide a template conforming to its SARS-CoV-2 contextual data specification for use worldwide. Templates are also being developed for One Health and foodborne pathogens. Overall, the DataHarmonizer tool improves the effectiveness and fidelity of contextual data capture as well as its subsequent usability. Harmonization of contextual information across authorities, platforms and systems globally improves interoperability and reusability of data for concerted public health and research initiatives to fight the current pandemic and future public health emergencies. While initially developed for the COVID-19 pandemic, its expansion to other data management applications and pathogens is already underway.

Keywords: contextual data; data management; genomic surveillance; harmonization; metadata.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no conflicts of interest.

Figures

Fig. 1.
Fig. 1.
The DataHarmonizer interface. (a) Users can access instructional videos, the reference guide and protocols for curation. Data providers can also access field-level guidance by double clicking on field headers. (b) Fields are colour-coded to indicate those that are required for Canadian SARS-CoV-2 surveillance (yellow), recommended (purple) and optional (grey). Users access different features, such as toggling between fields and automated filling of columns, via the control panel. (c) Validation of data highlights missing required information as well as errors which are highlighted in red. The ‘Next Error’ button enables users to scroll through and resolve errors systematically. (d) Users can export their data in different submission-ready formats.
Fig. 2.
Fig. 2.
Excerpt of a JSON file used to dynamically generate The DataHarmonizer application interface and functionality.
Fig. 3.
Fig. 3.
Validation of data. Missing or incorrect values are highlighted to better direct curation efforts. Curators can systematically address errors using the ‘Next Error’ button, which disappears when all issues have been addressed.
Fig. 4.
Fig. 4.
Customized exports automate data transformation for submission to a variety of third-party databases. (a) For Canadian national surveillance, contextual data can be exported to Canadian-specific portals and databases (e.g. VirusSeq Portal, CNPHI LaSER and NML LIMS) as well as international repositories (e.g. GISAID, NCBI). (b) Export from the PHA4GE template enables formatting for international databases.

References

    1. Seemann T, Lane CR, Sherry NL, Duchene S, Gonçalves da Silva A, et al. Tracking the COVID-19 pandemic in Australia using genomics. Nat Commun. 2020;11:4376. doi: 10.1038/s41467-020-18314-x. - DOI - PMC - PubMed
    1. McLaughlin A, Montoya V, Miller RL, Mordecai GJ, Worobey M, et al. Early and ongoing importations of SARS-CoV-2 in Canada. Epidemiology. doi: 10.1101/2021.04.09.21255131. - DOI - PMC - PubMed
    1. Fauver JR, Petrone ME, Hodcroft EB, Shioda K, Ehrlich HY, et al. Coast-to-coast spread of SARS-CoV-2 during the early epidemic in the United States. Cell. 2020;181:990–996. doi: 10.1016/j.cell.2020.04.021. - DOI - PMC - PubMed
    1. Zhang W, Govindavari JP, Davis BD, Chen SS, Kim JT, et al. Analysis of genomic characteristics and transmission routes of patients with confirmed SARS-CoV-2 in Southern California during the early stage of the US COVID-19 pandemic. JAMA Netw Open. 2020;3:e2024191. doi: 10.1001/jamanetworkopen.2020.24191. - DOI - PMC - PubMed
    1. Githinji G, de Laurent ZR, Mohammed KS, Omuoyo DO, Macharia PM, et al. Tracking the introduction and spread of SARS-CoV-2 in coastal Kenya. Nat Commun. 2021;12:4809. doi: 10.1038/s41467-021-25137-x. - DOI - PMC - PubMed

Publication types

Grants and funding