Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Sep 10;2(9):100322.
doi: 10.1016/j.patter.2021.100322.

The role of metadata in reproducible computational research

Affiliations
Review

The role of metadata in reproducible computational research

Jeremy Leipzig et al. Patterns (N Y). .

Abstract

Reproducible computational research (RCR) is the keystone of the scientific method for in silico analyses, packaging the transformation of raw data to published results. In addition to its role in research integrity, improving the reproducibility of scientific studies can accelerate evaluation and reuse. This potential and wide support for the FAIR principles have motivated interest in metadata standards supporting reproducibility. Metadata provide context and provenance to raw data and methods and are essential to both discovery and validation. Despite this shared connection with scientific data, few studies have explicitly described how metadata enable reproducible computational research. This review employs a functional content analysis to identify metadata standards that support reproducibility across an analytic stack consisting of input data, tools, notebooks, pipelines, and publications. Our review provides background context, explores gaps, and discovers component trends of embeddedness and methodology weight from which we derive recommendations for future work.

Keywords: FAIR; RCR; containers; metadata; notebooks; ontologies; pipelines; provenance; replicability; reproducibility; reproducible computational research; reproducible research; semantic; software dependencies; workflows.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Whitaker’s matrix of reproducibility Whitaker's matrix of reproducibility; made available under the Creative Commons Attribution license (CC-BY 4.0).
Figure 2
Figure 2
Case studies in reproducible research The term "case studies" is used in a general sense to describe any study of reproducibility. A reproduction is an attempt to arrive at comparable results with identical data using computational methods described in a paper. A refactor involves refactoring existing code into frameworks and reproducible best practices while preserving the original data. A replication involves generating new data and applying existing methods to achieve comparable results. A test of robustness applies various protocols, workflows, statistical models, or parameters to a given dataset to study their effect on results. A census is a high-level tabulation conducted by a third party. A survey is a questionnaire sent to practitioners. A case narrative is an in-depth first-person account. An independent discussion uses a secondary independent author to interpret the results of a study as a means to improve inferential reproducibility.
Figure 3
Figure 3
Reproducibility Censuses like this one by Obels et al. measure data and code availability and reproducibility in this case over a corpus of 118 studies, 62 of which were psychology studies that had preregistered a Registered Report (RR).
Figure 4
Figure 4
Ecological metadata language Geographic and temporal EML metadata and the associated display on Knowledge Network for Biocomplexity (KNB) from Halpern et al.
Figure 5
Figure 5
STATO Concepts describing a linear mixed model used by STATO
Figure 6
Figure 6
Common workflow language Snippets of a COVID-19 variant detection CWL workflow and the workflow as viewed through the cwl-viewer. Note the EDAM file definitions.
Figure 7
Figure 7
Singapore Framework application profile model
Figure 8
Figure 8
tximeta The high-level schematic of tximeta.

References

    1. Margolis R., Derr L., Dunn M., Huerta M., Larkin J., Sheehan J., Guyer M., Green E.D. The National Institutes of Health’s big data to knowledge (BD2K) initiative: capitalizing on biomedical big data. J. Am. Med. Inform. Assoc. 2014;21:957–958. - PMC - PubMed
    1. Brito J.J., Li J., Moore J.H., Greene C.S., Nogoy N.A., Garmire L.X., Mangul S. Recommendations to enhance rigor and reproducibility in biomedical research. Gigascience. 2020;9 doi: 10.1093/gigascience/giaa056. - DOI - PMC - PubMed
    1. National Academies of Sciences, Engineering, and Medicine; Policy and Global Affairs; Committee on Science, Engineering, Medicine, and Public Policy; Board on Research Data and Information; Division on Engineering and Physical Sciences; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Analytics; Division on Earth and Life Studies; Nuclear and Radiation Studies Board; Division of Behavioral and Social Sciences and Education; Committee on National Statistics; Board on Behavioral, Cognitive, and Sensory Sciences . National Academies Press; 2019. Committee on Reproducibility and Replicability in Science. Reproducibility and Replicability in Science.
    1. Wilkinson M.D., Dumontier M., Aalbersberg I.J.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.-W., da Silva Santos L.B., Bourne P.E. The FAIR guiding principles for scientific data management and stewardship. Sci. Data. 2016;3:160018. - PMC - PubMed
    1. Leipzig J. 2019. Awesome Reproducible Research.

LinkOut - more resources