Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 21:11:giac105.
doi: 10.1093/gigascience/giac105.

Making Common Fund data more findable: catalyzing a data ecosystem

Affiliations

Making Common Fund data more findable: catalyzing a data ecosystem

Amanda L Charbonneau et al. Gigascience. .

Abstract

The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National Institutes of Health Common Fund without requiring that data owners move, reformat, or rehost those data. This system is centered on a catalog that integrates detailed descriptions of biomedical datasets from individual Common Fund Programs' Data Coordination Centers (DCCs) into a uniform metadata model that can then be indexed and searched from a centralized portal. This Crosscut Metadata Model (C2M2) supports the wide variety of data types and metadata terms used by individual DCCs and can readily describe nearly all forms of biomedical research data. We detail its use to ingest and index data from 11 DCCs.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1:
Figure 1:
A simplified entity relationship diagram for the C2M2 where most association and controlled vocabulary tables are collapsed into the main entity tables. Each entity is shown as a large table; fields outlined in black are columns of the tables in which they appear; fields labeled with blue diamonds abbreviate connections to container entities and their related associations. Fields marked with green circles indicate controlled vocabulary terms, and fields outlined in gray indicate association-table connections to entities named therein. Gray-bordered fields with green circles indicate connections between bare CV terms (included in the entity tables) and master controlled vocabulary term tables, which track global term usage (term tables are described in detail in the Results section). Lines are drawn connecting fields that participate in interentity relationships. Boxes on these paths name association tables that instantiate these connections but do not explicitly list those tables’ fields.
Figure 2:
Figure 2:
CFDE submission process. A DCC initiates the submission process by creating a new set of TSVs that meet the C2M2 requirements, running a CFDE tool to build term tables, and submitting that entire datapackage. The cfde-submit CLI then performs a lightweight validation of the submission data, starts the data upload to CFDE's servers (step 1), and then initiates processing in the cloud (step 2). The system that manages the cloud processing is called Globus Flows. Globus Flows is Globus software-as-a-service (SaaS) running in the AWS cloud. CFDE's submission process is one of many “flows” that the flows service manages, and the final action of cfde-submit is to start a run of the CFDE submission flow. The CFDE submission flow moves the submitted data to a permanent location (step 3), sets access permissions (not shown), and executes code on a CFDE server (step 4) that ingests the submitted data into the CFDE portal's database service, Deriva. While processing is happening in the cloud (steps 2–3), status can be checked using cfde-submit, but it does not appear in the CFDE portal until step 4. At this point, the DCC uses the CFDE portal to review and approve (or reject) the datapackage (step 5). Deriva then merges the new datapackage into a test catalog before finally publishing it to the public catalog (step 6), making it searchable by anyone at the CFDE portal.
Figure 3:
Figure 3:
Summary page of a submitted data package with interactive chart and summary statistics.
Figure 4:
Figure 4:
Core data available for search at the CFDE portal over time. The sharp decrease in biosamples in October 2021 is due to replicate cell line data being more appropriately modeled as from a single biosample. Note that the y-axis is exponential, and therefore the increases are quite large: the January 2022 release, for example, contains nearly half a million (430,405) more files than the October 2021 release.

References

    1. Pronk TE. The time efficiency gain in sharing and reuse of research data. Data Sci J. 2019;18:10.
    1. Thanos C. Research Data Reusability: Conceptual Foundations, Barriers and Enabling Technologies. Pisa, Italy: Multidisciplinary Digital Publishing Institute, 2017.
    1. van de Sandt S, Dallmeier-Tiessen S, Lavasa A, et al. The definition of reuse. Data Sci J. 2019;18:22.
    1. Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):3. - PMC - PubMed
    1. EU High-Level Group on Scientific Data . Riding the Wave: how Europe can gain from the rising tide of scientific data. 2010. https://www.fosteropenscience.eu/content/riding-wave-how-europe-can-gain.... Accessed 1 November 2021.

Publication types