Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 22;11(1):524.
doi: 10.1038/s41597-024-03349-2.

From Planning Stage Towards FAIR Data: A Practical Metadatasheet For Biomedical Scientists

Affiliations

From Planning Stage Towards FAIR Data: A Practical Metadatasheet For Biomedical Scientists

Lea Seep et al. Sci Data. .

Abstract

Datasets consist of measurement data and metadata. Metadata provides context, essential for understanding and (re-)using data. Various metadata standards exist for different methods, systems and contexts. However, relevant information resides at differing stages across the data-lifecycle. Often, this information is defined and standardized only at publication stage, which can lead to data loss and workload increase. In this study, we developed Metadatasheet, a metadata standard based on interviews with members of two biomedical consortia and systematic screening of data repositories. It aligns with the data-lifecycle allowing synchronous metadata recording within Microsoft Excel, a widespread data recording software. Additionally, we provide an implementation, the Metadata Workbook, that offers user-friendly features like automation, dynamic adaption, metadata integrity checks, and export options for various metadata standards. By design and due to its extensive documentation, the proposed metadata standard simplifies recording and structuring of metadata for biomedical scientists, promoting practicality and convenience in data management. This framework can accelerate scientific progress by enhancing collaboration and knowledge transfer throughout the intermediate steps of data creation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Alignment of Metadata Lifecycle with the Research Data-Lifecycle. (A) Metadata is created alongside the research data creation, however, often only gathered at the point of publication when it is requested from, e.g., repositories, marking a clear decisive point before open accessibility of produced data. (B) The structure of the proposed Metadatasheet is defined by its sections, which further encompass segments. Within each segment user input is required, which can be of different forms, e.g., values to keys or table entries. (C) Upon complete records, the Metadata Workbook can export either to a plain xlsx file or to the requested NCBI GEO metadata format. Deposited data can be accessed by a plethora of tools (examples given). Outside the workbook a single xlsx file can be converted to a SummarizedExperiment object for data analysis, multiple Metadatasheets can be transformed to xml files using the provided ontology to build the input for a topic-centred database.
Fig. 2
Fig. 2
Example of an instance of the Planning section. (A) Overview Planning section. (B) General segment contains contact information and general project information in form of key:value pairs; on its second level, linked Metadatasheets can be specified. (C) The experimental system segment is requesting keys dependent on the value given to key ‘Experimental System’. For tissue type, the controlled vocabulary encompasses ontology terms taken from BRENDA Tissue Ontology (BTO). (D) Comparison group segment; here the only comparison group is ‘diet’. defined through diet (other comparison group options as treatment etc. not shown). As six groups are requested by the user a table is present with six columns (only two shown). Information per specified group is expected column-wise. Note that the full Metadatasheet of this example can be found in Supplementary Material.
Fig. 3
Fig. 3
Example of an instance of the Conduction section. (A) Overview Conduction section. (B) The ‘total_groups’ segment expects all possible combinations of the comparison groups defined in the Planning section. Number of replicates belongs underneath each group. In the Metadatasheet implementation, ‘final_groups’ are generated; pink colour marks an expected table. (C) The segment covariates/constants requests respective specification including units. For constants, the value is expected in place, whereas covariates values are expected within the measurement-matching table. (D) Time-Dependence-timeline segment collapses completely if not required. (E) Preparation segment expects the procedure that is required before the actual measurement. Here, the reference to either a fixed protocol, chosen from the controlled vocabulary or a filename is expected. The specified file is expected to be on the same level as the Metadatasheet in the filesystem. (F) The Measurement segment is requesting keys depending on the value given to key measurement type. (G) The DataFiles-Linkage segment specifies how to identify the correct measurement file given the subsequent (within the measurement matching section) specified personal ID. If there is no clear pattern, one can choose keyword ‘CHANGES’ to promote filename specification to the measurement matching section. Note that the full Metadatasheet of this example can be found in Supplementary Material.
Fig. 4
Fig. 4
Advanced example of segments within the Conduction section. (A) Within the Time-Dependence Timeline segment, given comparison groups can be enriched with time dependent information on the second hierarchy level. One specifies which of the comparison groups is to be enriched with timeline information and the unit of time. Then, time-steps can be specified. Pink colour marks the table, which needs to be filled. (B) Within the Preparation segment, one can supply up to two divisions of the original experimental system sample. Here, from the liver of mice, two cell types are isolated. The liver isolation has the same protocol, while cell type isolation has differing protocols. The respective files are expected to be on the same level as the Metadatasheet in the filesystem.
Fig. 5
Fig. 5
Example of an instance of the Measurement-Matching section. (A) Overview Measurement-Matching section. (B) An ID-specific metadata table example with the minimal number of required rows. The yellow marked cells hold measurement IDs (‘personal_ID’) required for the matching of metadata column with the respective measured data. ‘NA’ indicates non-available information (‘Diet’ is the only comparison group specified). The last two rows indicate that neither subsamples nor subsubsamples are needed in this instance. The table is column cropped; based on previous final groups and given replicates, a total of 30 columns are expected in the full table. Note that the full Metadatasheet of this example can be found in Supplementary Material.

References

    1. Morillo F, Bordons M, Gómez I. Interdisciplinarity in science: A tentative typology of disciplines and research areas. Journal of the American Society for Information Science and Technology. 2003;54:1237–1249. doi: 10.1002/asi.10326. - DOI
    1. Cioffi M, Goldman J, Marchese S. 2023. Harvard biomedical research data lifecycle. Zenodo. - DOI
    1. Habermann, T. Metadata life cycles, use cases and hierarchies. Geosciences8, 10.3390/geosciences8050179 (2018).
    1. Stevens I, et al. Ten simple rules for annotating sequencing experiments. PLOS Computational Biology. 2020;16:1–7. doi: 10.1371/journal.pcbi.1008260. - DOI - PMC - PubMed
    1. Shaw F, et al. Copo: a metadata platform for brokering fair data in the life sciences. F1000Research. 2020;9:495. doi: 10.12688/f1000research.23889.1. - DOI

Grants and funding