Review

. 2021 Sep 10;2(9):100322.

doi: 10.1016/j.patter.2021.100322.

The role of metadata in reproducible computational research

Jeremy Leipzig¹, Daniel Nüst², Charles Tapley Hoyt³, Karthik Ram⁴, Jane Greenberg¹

Affiliations

¹ Metadata Research Center, College of Computing and Informatics, Drexel University, Philadelphia, PA, USA.
² Institute for Geoinformatics, University of Münster, Münster, Germany.
³ Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.
⁴ Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, CA, USA.

PMID: 34553169
PMCID: PMC8441584
DOI: 10.1016/j.patter.2021.100322

Review

The role of metadata in reproducible computational research

Jeremy Leipzig et al. Patterns (N Y). 2021.

. 2021 Sep 10;2(9):100322.

doi: 10.1016/j.patter.2021.100322.

Authors

Jeremy Leipzig¹, Daniel Nüst², Charles Tapley Hoyt³, Karthik Ram⁴, Jane Greenberg¹

Affiliations

¹ Metadata Research Center, College of Computing and Informatics, Drexel University, Philadelphia, PA, USA.
² Institute for Geoinformatics, University of Münster, Münster, Germany.
³ Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.
⁴ Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, CA, USA.

PMID: 34553169
PMCID: PMC8441584
DOI: 10.1016/j.patter.2021.100322

Abstract

Reproducible computational research (RCR) is the keystone of the scientific method for in silico analyses, packaging the transformation of raw data to published results. In addition to its role in research integrity, improving the reproducibility of scientific studies can accelerate evaluation and reuse. This potential and wide support for the FAIR principles have motivated interest in metadata standards supporting reproducibility. Metadata provide context and provenance to raw data and methods and are essential to both discovery and validation. Despite this shared connection with scientific data, few studies have explicitly described how metadata enable reproducible computational research. This review employs a functional content analysis to identify metadata standards that support reproducibility across an analytic stack consisting of input data, tools, notebooks, pipelines, and publications. Our review provides background context, explores gaps, and discovers component trends of embeddedness and methodology weight from which we derive recommendations for future work.

Keywords: FAIR; RCR; containers; metadata; notebooks; ontologies; pipelines; provenance; replicability; reproducibility; reproducible computational research; reproducible research; semantic; software dependencies; workflows.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Whitaker’s matrix of reproducibility Whitaker's matrix of reproducibility; made available under the Creative Commons Attribution license (CC-BY 4.0).

**Figure 2**
Case studies in reproducible research The term "case studies" is used in a general sense to describe any study of reproducibility. A reproduction is an attempt to arrive at comparable results with identical data using computational methods described in a paper. A refactor involves refactoring existing code into frameworks and reproducible best practices while preserving the original data. A replication involves generating new data and applying existing methods to achieve comparable results. A test of robustness applies various protocols, workflows, statistical models, or parameters to a given dataset to study their effect on results. A census is a high-level tabulation conducted by a third party. A survey is a questionnaire sent to practitioners. A case narrative is an in-depth first-person account. An independent discussion uses a secondary independent author to interpret the results of a study as a means to improve inferential reproducibility.

**Figure 3**
Reproducibility Censuses like this one by Obels et al. measure data and code availability and reproducibility in this case over a corpus of 118 studies, 62 of which were psychology studies that had preregistered a Registered Report (RR).

**Figure 4**
Ecological metadata language Geographic and temporal EML metadata and the associated display on Knowledge Network for Biocomplexity (KNB) from Halpern et al.

**Figure 5**
STATO Concepts describing a linear mixed model used by STATO

**Figure 6**
Common workflow language Snippets of a COVID-19 variant detection CWL workflow and the workflow as viewed through the cwl-viewer. Note the EDAM file definitions.

**Figure 7**
Singapore Framework application profile model

**Figure 8**
tximeta The high-level schematic of tximeta.

See this image and copyright information in PMC

References

1. Margolis R., Derr L., Dunn M., Huerta M., Larkin J., Sheehan J., Guyer M., Green E.D. The National Institutes of Health’s big data to knowledge (BD2K) initiative: capitalizing on biomedical big data. J. Am. Med. Inform. Assoc. 2014;21:957–958. - PMC - PubMed
1. Brito J.J., Li J., Moore J.H., Greene C.S., Nogoy N.A., Garmire L.X., Mangul S. Recommendations to enhance rigor and reproducibility in biomedical research. Gigascience. 2020;9 doi: 10.1093/gigascience/giaa056. - DOI - PMC - PubMed
1. National Academies of Sciences, Engineering, and Medicine; Policy and Global Affairs; Committee on Science, Engineering, Medicine, and Public Policy; Board on Research Data and Information; Division on Engineering and Physical Sciences; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Analytics; Division on Earth and Life Studies; Nuclear and Radiation Studies Board; Division of Behavioral and Social Sciences and Education; Committee on National Statistics; Board on Behavioral, Cognitive, and Sensory Sciences . National Academies Press; 2019. Committee on Reproducibility and Replicability in Science. Reproducibility and Replicability in Science.
1. Wilkinson M.D., Dumontier M., Aalbersberg I.J.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.-W., da Silva Santos L.B., Bourne P.E. The FAIR guiding principles for scientific data management and stewardship. Sci. Data. 2016;3:160018. - PMC - PubMed
1. Leipzig J. 2019. Awesome Reproducible Research.

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The role of metadata in reproducible computational research

Affiliations

The role of metadata in reproducible computational research

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources