Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 20:11:e72677.
doi: 10.2196/72677.

A Cloud-Based Platform for Harmonized COVID-19 Data: Design and Implementation of the Rapid Acceleration of Diagnostics (RADx) Data Hub

Affiliations

A Cloud-Based Platform for Harmonized COVID-19 Data: Design and Implementation of the Rapid Acceleration of Diagnostics (RADx) Data Hub

Marcos Martínez-Romero et al. JMIR Public Health Surveill. .

Abstract

Background: The COVID-19 pandemic exposed significant limitations in existing data infrastructure, particularly the lack of systems for rapidly collecting, integrating, and analyzing data to support timely and evidence-based public health responses. These shortcomings hampered efforts to conduct comprehensive analyses and make rapid, data-driven decisions in response to emerging threats. To overcome these challenges, the US National Institutes of Health launched the Rapid Acceleration of Diagnostics (RADx) initiative. A key component of this initiative is the RADx Data Hub-a centralized, cloud-based platform designed to support data sharing, harmonization, and reuse across multiple COVID-19 research programs and data sources.

Objective: We aim to present the design, implementation, and capabilities of the RADx Data Hub, a cloud-based platform developed to support findable, accessible, interoperable, reusable (FAIR) data practices and enable secondary analyses of the COVID-19-related data contributed by a nationwide network of researchers.

Methods: The RADx Data Hub was developed on a scalable cloud infrastructure, grounded in the FAIR data principles. The platform integrates heterogeneous data types-including clinical data, diagnostic test results, behavioral data, and social determinants of health-submitted by over 100 research organizations across 46 US states and territories. The data pipeline includes automated and manual processes for deidentification, quality validation, expert curation, and harmonization. Metadata standards are enforced using tools such as the Center for Expanded Data Annotation and Retrieval (CEDAR) Workbench and BioPortal. Data files are structured using a unified specification to support consistent representation and machine-actionable metadata.

Results: As of May 2025, the RADx Data Hub hosts 187 studies and over 1700 data files, spanning 4 RADx programs: RADx Underserved Populations (RADx-UP), RADx Radical (RADx-rad), RADx Tech, and RADx Digital Health Technologies (RADx DHT). The Study Explorer and Analytics Workbench components enable researchers to discover relevant studies, inspect rich metadata, and conduct analyses within a secure cloud-based environment. Harmonized data conforming to a core set of common data elements facilitate cross-study integration and support secondary use. The platform provides persistent identifiers (digital object identifiers) for each study and supports access to structured metadata that adhere to the CEDAR specification, available in both JSON and YAML formats for seamless integration into computational workflows.

Conclusions: The RADx Data Hub successfully addresses key data integration challenges by providing a centralized, FAIR-compliant platform for public health research. Its adaptable architecture and data management practices are designed to support secondary analyses and can be repurposed for other scientific disciplines, strengthening data infrastructure and enhancing preparedness for future health crises.

Keywords: COVID-19 surveillance; FAIR data sharing; cloud-based data platform; data harmonization and integration; digital health research; health disparities; metadata standards; pandemic response informatics; public health data infrastructure; secondary data analysis.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Components of Data Hub studies. This figure illustrates the organization and structure of study data within the Data Hub. On the left, the schematic shows how study data are organized into file bundles, each containing a data file (CSV), file metadata (JSON), and a data dictionary (CSV) that defines the structure and semantics of each variable. Each study is also accompanied by structured study metadata (in JSON format) and supporting study documentation, including protocols and project information. On the right, an example data file is shown, demonstrating a tabular format in which each row corresponds to a study participant and each column represents a standardized variable (eg, race, ethnicity, age, sex, education) harmonized using the Data Hub’s common data elements.
Figure 2
Figure 2
Overview of the Data Hub’s end-to-end data pipeline. Individual investigators and research groups collect and transmit data to the appropriate (C)DCC (coordination and data collection center). At the (C)DCCs, staff deidentify, harmonize, and submit the study data to the Data Hub. Any identified issues are communicated to data contributors, creating a feedback loop for iterative improvements and alignment with Data Hub standards. Once validated, the data are approved, curated, stored in the Data Hub, and then made accessible to users. The arrows reflect the flow of data through the pipeline and are not intended to represent the full range of user interactions with the system.
Figure 3
Figure 3
Study Explorer. The figure shows a screenshot of the Data Hub Study Explorer, with the “study population focus” filter applied to retrieve 88 studies on underserved or vulnerable populations. A search box at the top, equipped with an autocomplete feature, allows users to quickly find studies or variables by entering relevant keywords. The interface includes Studies and Variables tabs, enabling users to explore not only study-level metadata but also variable-level details across studies. The filtered results are displayed in a table that includes information such as study names, Database of Genotypes and Phenotypes accessions, sample sizes, study domains, study designs, and data collection methods.
Figure 4
Figure 4
Study Overview page. This figure provides an overview of the information available for a selected study within the Data Hub. The panel at left summarizes key metadata, such as the Database of Genotypes and Phenotypes (dbGaP) study accession, program, study domain, and data collection methods. The panel at right is divided into Study Documents and Data Files. The Study Documents section lists supporting files, such as study protocols and README files. The Data Files section contains the data files and their corresponding metadata, along with a button to request access to the data through the study’s dbGaP page.
Figure 5
Figure 5
Metadata Viewer. The left side of the figure displays metadata for a specific data file, with terms constrained to ontology terms, linked to BioPortal for further exploration. The right side demonstrates a link from the “subject identifier” field to the corresponding term in the Medical Subject Headings (MeSH). Metadata are organized into expandable and collapsible sections for easy navigation, with help text providing guidance and context. The Metadata Viewer renders a Center for Expanded Data Annotation and Retrieval (CEDAR) template designed for Data Hub files, ensuring consistency and accessibility across metadata records.
Figure 6
Figure 6
Public data and Analytics Workbench. This figure demonstrates 2 key components of the Data Hub that support data accessibility and exploratory analysis. The left panel shows the Public Data page, where users can browse and download openly available, harmonized synthetic datasets or transfer them to the Analytics Workbench environment. Example files include harmonized data tables, codebooks, and metadata files. The right panel displays the Analytics Workbench in use, where a Jupyter Notebook is used to analyze education-level distributions using bar chart visualizations. This interface supports interactive, code-based data exploration using Python, R, and other tools, enabling secure, scalable, and reproducible analysis directly within the cloud platform.
Figure 7
Figure 7
Overview of the Data Hub’s software architecture. This diagram presents the architecture of the Data Hub, highlighting its integration of backend services, web features, and data access pathways. On the left, data producers—including individual investigators and (C)DCCs (coordination and data collection centers)—submit study data via secure web or secure file transfer protocol interfaces. The core platform provides system features such as the Study Explorer, Variable Catalog, Metadata Viewer, and Analytics Workbench, with user authentication managed via the National Institutes of Health’s Researcher Auth Service (RAS) and data access controlled by the Database of Genotypes and Phenotypes (dbGaP). Backend services support ontology-based metadata management (Center for Expanded Data Annotation and Retrieval [CEDAR], BioPortal), metadata-driven search (OpenSearch), and scalable analytics in a cloud-hosted workspace (Amazon SageMaker). On the right, data consumers—including researchers, National Institutes of Health staff, and system administrators—access harmonized study data through web interfaces or application programming interfaces (APIs) for exploration, analysis, and reuse. The modular architecture supports secure, interoperable, and reproducible digital health research across a wide range of public health applications.
Figure 8
Figure 8
Harmonization workflow for data and metadata. This figure outlines the two-step process used to harmonize incoming data files and metadata within the Data Hub. The Data Hub receives original data files, accompanied by their respective metadata files and data dictionaries (left). In step 1, the original data dictionaries are mapped to the Data Hub Global Codebook for common data elements, standardizing variable definitions and allowable values across studies. In step 2, the original data and metadata are transformed into harmonized formats: data files are transformed using mappings from step 1; metadata files are aligned to a Center for Expanded Data Annotation and Retrieval (CEDAR) metadata template; and data dictionaries are converted to the Data Hub Data Dictionary Specification (center). This process ensures the resulting harmonized data files, metadata, and dictionaries are consistent and interoperable across the Data Hub (right).
Figure 9
Figure 9
Data harmonization example. This figure illustrates the transformation of an original variable, "edu_years_of_school," from a RADx-UP study into a harmonized common data element (CDE) named nih_education using the Data Hub’s Global Codebook. The left panel displays original data values and their definitions as captured in the study-specific data dictionary. In the center, these original values are mapped to standardized values in the Global Codebook, aligning semantically equivalent categories across studies. For instance, “6: Bachelor’s degree” in the original dictionary maps to “4: Bachelor’s degree” in the CDE. The right panel shows the resulting harmonized data, where the variable name and values have been transformed to conform with the CDE definition, ensuring consistency and interoperability for cross-study analyses.
Figure 10
Figure 10
Distribution of Data Hub studies by study domain. This histogram illustrates the distribution of studies across various research domains, highlighting the diverse focus areas of RADx studies. RADx-UP accounts for the largest share of studies, particularly in domains such as community outreach, social determinants of health, and vaccination. The notable representation of RADx-rad in areas like rapid diagnostics and biosensing underscores the emphasis on novel and experimental approaches to COVID-19 detection and monitoring. This distribution highlights the cross-programmatic diversity and strategic focus areas of the RADx initiative. Note: Only the top 20 study domains are shown; additional study domains were excluded for clarity.
Figure 11
Figure 11
Distribution of Data Hub studies by population focus. This chart presents the breakdown of studies based on population focus. The figure underscores the RADx-UP program’s emphasis on underserved and vulnerable populations, with substantial representation of studies involving older adults, racial and ethnic minorities, low socioeconomic status groups, rural communities, and incarcerated or institutionalized individuals. While many studies target specific at-risk groups—including children, essential workers, and people living with HIV/AIDS—a significant number also focus on general adult populations. This distribution highlights the wide demographic coverage of the Data Hub, supporting diverse and comparative analyses on health disparities, equity, and access in the context of the COVID-19 pandemic. Note: Only the top 20 population focus categories are shown; additional categories were excluded for clarity.
Figure 12
Figure 12
Distribution of Data Hub studies by data collection method. This chart categorizes studies based on their data collection approaches, stratified by RADx program. Survey-based and interview or focus group methods are predominant, particularly among RADx-UP studies, reflecting an emphasis on community engagement and participant-reported outcomes. RADx Tech and RADx DHT programs contribute more heavily to studies utilizing device-based methods, including molecular and antigen testing, wearables, smartphones, and contact tracing tools. Less common but innovative modalities—such as wastewater sampling, breath and chemosensory testing, and electrochemical diagnostics—demonstrate methodological breadth across the RADx portfolio. The heterogeneity of data sources supports integration and cross-study analyses for richer insights into pandemic-related health outcomes.
Figure 13
Figure 13
Distribution of Data Hub studies by study design. This chart illustrates the frequency of study designs used across the Data Hub, categorized by program. Longitudinal cohort designs are by far the most common, particularly within RADx-UP, reflecting a focus on long-term, community-based public health research. Interventional or clinical trial designs also feature prominently, especially in RADx Tech and RADx DHT, which emphasize testing and digital health innovation. Additional designs—including cross-sectional, case-control, device validation and verification studies, and mixed methods—demonstrate methodological diversity across the platform. This variety supports a wide range of analyses, from exploratory to evaluative, across observational and experimental frameworks.

References

    1. Massoudi BL, Sobolevskaia D. Keep moving forward: health informatics and information management beyond the COVID-19 pandemic. Yearb Med Inform. 2021 Aug;30(1):75–83. doi: 10.1055/s-0041-1726499. - DOI - PMC - PubMed
    1. Bhatia S, Imai N, Watson OJ, Abbood A, Abdelmalik P, Cornelissen T, Ghozzi S, Lassmann B, Nagesh R, Ragonnet-Cronin ML, Schnitzler JC, Kraemer MU, Cauchemez S, Nouvellet P, Cori A. Lessons from COVID-19 for rescalable data collection. Lancet Infect Dis. 2023 Sep;23(9):e383–8. doi: 10.1016/S1473-3099(23)00121-4. https://europepmc.org/abstract/MED/37150186 S1473-3099(23)00121-4 - DOI - PMC - PubMed
    1. Galaitsi SE, Cegan JC, Volk K, Joyner M, Trump BD, Linkov I. The challenges of data usage for the United States' COVID-19 response. Int J Inf Manage. 2021 Aug;59:102352. doi: 10.1016/j.ijinfomgt.2021.102352. https://europepmc.org/abstract/MED/33824545 S0268-4012(21)00045-1 - DOI - PMC - PubMed
    1. Haendel MA, Chute CG, Bennett TD, Eichmann DA, Guinney J, Kibbe WA, Payne PR, Pfaff ER, Robinson PN, Saltz JH, Spratt H, Suver C, Wilbanks J, Wilcox AB, Williams AE, Wu C, Blacketer C, Bradford RL, Cimino JJ, Clark M, Colmenares EW, Francis PA, Gabriel D, Graves A, Hemadri R, Hong SS, Hripscak G, Jiao D, Klann JG, Kostka K, Lee AM, Lehmann HP, Lingrey L, Miller RT, Morris M, Murphy SN, Natarajan K, Palchuk MB, Sheikh U, Solbrig H, Visweswaran S, Walden A, Walters KM, Weber GM, Zhang XT, Zhu RL, Amor B, Girvin AT, Manna A, Qureshi N, Kurilla MG, Michael SG, Portilla LM, Rutter JL, Austin CP, Gersing KR. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. J Am Med Inform Assoc. 2021 Mar 01;28(3):427–43. doi: 10.1093/jamia/ocaa196. https://europepmc.org/abstract/MED/32805036 5893482 - DOI - PMC - PubMed
    1. Pfaff ER, Girvin AT, Bennett TD, Bhatia A, Brooks IM, Deer RR, Dekermanjian JP, Jolley SE, Kahn MG, Kostka K, McMurry JA, Moffitt R, Walden A, Chute CG, Haendel MA. Identifying who has long COVID in the USA: a machine learning approach using N3C data. Lancet Digit Health. 2022 Jul;4(7):e532–41. doi: 10.1016/S2589-7500(22)00048-6. https://linkinghub.elsevier.com/retrieve/pii/S2589-7500(22)00048-6 S2589-7500(22)00048-6 - DOI - PMC - PubMed