Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 11;14(4):e0213013.
doi: 10.1371/journal.pone.0213013. eCollection 2019.

Reproducible big data science: A case study in continuous FAIRness

Affiliations

Reproducible big data science: A case study in continuous FAIRness

Ravi Madduri et al. PLoS One. .

Erratum in

Abstract

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. A high-level view of the TFBS identification workflow, showing the six principal datasets, labeled D1–D6, and the five computational phases, labeled .
Fig 2
Fig 2. Network topology showing the distributed environment which was used to generate the six principal datasets, labeled D1–D6, and the locations of the five computational phases, labeled .
Fig 3
Fig 3. An example BDBag, with contents in the data folder, description in the metadata folder, and other elements providing data required to fetch remote elements (fetch.txt) and validate its components.
Fig 4
Fig 4. A minid landing page for a BDBag generated by the encode2bag tool, showing the associated metadata, including locations (in this case, just one).
Fig 5
Fig 5. The encode2bag portal.
The user has entered an ENCODE query for urinary bladder DNase-seq data and clicked “Create BDBag.” The portal generates a Minid for the BDBag and a Globus link for reliable, high-speed access.
Fig 6
Fig 6. Our DNase-seq ensemble footprinting workflow, used to implement and of Fig 1.
The master workflow A takes a BDBag from formula image as input. It executes from top to bottom, using subworkflows B and C to implement formula image and then subworkflow D to implement formula image. It produces as output BDBags containing aligned DNase-seq data and footprints, with the latter serving as input to formula image.

References

    1. Hey T, Tansley S, Tolle KM. The fourth paradigm: Data-intensive scientific discovery. Microsoft research; Redmond, WA; 2009.
    1. Kitchin R. Big Data, new epistemologies and paradigm shifts. Big Data & Society. 2014;1(1):2053951714528481.
    1. Tenopir C, Allard S, Douglass K, Aydinoglu AU, Wu L, Read E, et al. Data sharing by scientists: practices and perceptions. PLOS ONE. 2011;6(6):e21101 10.1371/journal.pone.0021101 - DOI - PMC - PubMed
    1. Collins FS, Varmus H. A new initiative on precision medicine. New England Journal of Medicine. 2015;372(9):793–795. 10.1056/NEJMp1500523 - DOI - PMC - PubMed
    1. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016;3:160018 10.1038/sdata.2016.18 - DOI - PMC - PubMed

Publication types