e!DAL--a framework to store, share and publish research data

Daniel Arend¹, Matthias Lange, Jinbo Chen, Christian Colmsee, Steffen Flemming, Denny Hecht, Uwe Scholz

Affiliations

PMID: 24958009
PMCID: PMC4080583
DOI: 10.1186/1471-2105-15-214

e!DAL--a framework to store, share and publish research data

Daniel Arend et al. BMC Bioinformatics. 2014.

. 2014 Jun 24:15:214.

doi: 10.1186/1471-2105-15-214.

Authors

Daniel Arend¹, Matthias Lange, Jinbo Chen, Christian Colmsee, Steffen Flemming, Denny Hecht, Uwe Scholz

Affiliation

¹ Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), OT Gatersleben, Corrensstr, 3, 06466 Stadt Seeland, Germany. arendd@ipk-gatersleben.de.

PMID: 24958009
PMCID: PMC4080583
DOI: 10.1186/1471-2105-15-214

Abstract

Background: The life-science community faces a major challenge in handling "big data", highlighting the need for high quality infrastructures capable of sharing and publishing research data. Data preservation, analysis, and publication are the three pillars in the "big data life cycle". The infrastructures currently available for managing and publishing data are often designed to meet domain-specific or project-specific requirements, resulting in the repeated development of proprietary solutions and lower quality data publication and preservation overall.

Results: e!DAL is a lightweight software framework for publishing and sharing research data. Its main features are version tracking, metadata management, information retrieval, registration of persistent identifiers (DOI), an embedded HTTP(S) server for public data access, access as a network file system, and a scalable storage backend. e!DAL is available as an API for local non-shared storage and as a remote API featuring distributed applications. It can be deployed "out-of-the-box" as an on-site repository.

Conclusions: e!DAL was developed based on experiences coming from decades of research data management at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). Initially developed as a data publication and documentation infrastructure for the IPK's role as a data center in the DataCite consortium, e!DAL has grown towards being a general data archiving and publication infrastructure. The e!DAL software has been deployed into the Maven Central Repository. Documentation and Software are also available at: http://edal.ipk-gatersleben.de.

PubMed Disclaimer

Figures

**Figure 1**
**Publication Process of Research Data Sample.** The data-publication process (inspired by Gray et al. [11]) expresses the different manifestations of research data. At the *top layer* of the process, the journal, author, or scientist takes full responsibility for the publication, including the aggregated data embedded in it and the way the data is presented. For data published in the *second layer*, as supplementary files to articles, the link to the published “Record of Science” remains strong; but it is not always clear at what level the data is curated and preserved and if the criteria for discoverability and re-usability are met. At the *Data Collections and Structured Database layer*, the publication includes a citation and links to the data; but the data resides in and is the responsibility of a separate repository. At the *bottom layer*, most datasets remain unpublished and are consequently not accessible for later reanalysis.

**Figure 2**
**The** ***e!DAL*** **data schema.** Conceptual overview of the major entities of the *e!DAL* infrastructure.

**Figure 3**
**Data publication workflow.** To ensure a trusted release of research data, a review process has been designed. The first step for a data-publication request is the generation of a “landing page” for the applied citable identifier. The underlying URL is served by the embedded HTTP server. If a dataset has a release date in the future, the page locks the data download. If the user requested a DOI from DataCite, the system generates a unique DOI and migrates the metadata to DataCite-XML format. After the reviewer approves the publication, the DOI request is sent to the DataCite REST web service. Finally, the user gets an email notification with the accepted DOI or URL.

**Figure 4**
**Architecture of** ***e!DAL*** **-API.** The green nodes are the parts of the core *e!DAL*-API, the *e!DAL*-server, and the *e!DAL*-client packages. The yellow nodes represent the implementation interface, and the blue nodes represent the backend components. The red nodes symbolize possible applications.

**Figure 5**
**Performance benchmark.** Performance tests for local embedded (left) and server-client architecture (right). We used data sets with 10,000 files in 100 folders, but with different file sizes (0.1, 0.5, and 1.0 MB). The left y-axis shows the time required to store all of the objects and read them again to a new directory. The right y-axis shows the performance of the index-based search. Using the read/store test set, we sent queries, which gave exactly 10 results each. All tests were executed on a Linux system with a six-core AMD Phenom II X6 1055T Processor at 2.8 GHz and 64 MB heap space for the JAVA virtual machine. The system had a 1-GB ethernet connection and a SATA hard disk (7200 Rpm).

**Figure 6**
**EdalFileChooser dialog.** The eDAL-FileChooser dialog comprises several components as follows: (1) a file tree to navigate through the stored directories, (2) a window to display all files and subdirectories in the chosen folder, (3) textfields to display the meta information of the chosen version (to change the meta information, the user has to double click on a textfield), (4) a table to show all stored versions of a digital object (the user has to switch between the versions by marking a field), (5) a textfield to show the complete path of the current object, (6) a textfield for search function, and (7) open dialogs to change permissions or metadata.

See this image and copyright information in PMC

References

1. Craddock T, Harwood CR, Hallinan J, Wipat A. e-Science: relieving bottlenecks in large-scale genome analyses. Nat Rev Microbiol. 2008;6(12):248–954. - PubMed
1. Brooksbank C, Bergman MT, Apweiler R, Birney E, Thornton J. The european bioinformatics institute’s data resources 2014. Nucleic Acids Res. 2013;42:D18–D25. doi:10.1093/nar/gkt1206. - PMC - PubMed
1. Roos DS. Computational biology: bioinformatics–trying to swim in a sea of data. Science. 2001;291(5507):1260–1261. - PubMed
1. Fernández-Suárez XM, Galperin MY. The 2013 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res. 2013;41(D1):1–7. - PMC - PubMed
1. Kodama Y, Shumway M, Leinonen R. International nucleotide sequence database collaboration: the sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40(Database issue):D54–D56. doi:10.1093/nar/gkr854. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

e!DAL--a framework to store, share and publish research data

Affiliation

e!DAL--a framework to store, share and publish research data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases