Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 24:15:214.
doi: 10.1186/1471-2105-15-214.

e!DAL--a framework to store, share and publish research data

Affiliations

e!DAL--a framework to store, share and publish research data

Daniel Arend et al. BMC Bioinformatics. .

Abstract

Background: The life-science community faces a major challenge in handling "big data", highlighting the need for high quality infrastructures capable of sharing and publishing research data. Data preservation, analysis, and publication are the three pillars in the "big data life cycle". The infrastructures currently available for managing and publishing data are often designed to meet domain-specific or project-specific requirements, resulting in the repeated development of proprietary solutions and lower quality data publication and preservation overall.

Results: e!DAL is a lightweight software framework for publishing and sharing research data. Its main features are version tracking, metadata management, information retrieval, registration of persistent identifiers (DOI), an embedded HTTP(S) server for public data access, access as a network file system, and a scalable storage backend. e!DAL is available as an API for local non-shared storage and as a remote API featuring distributed applications. It can be deployed "out-of-the-box" as an on-site repository.

Conclusions: e!DAL was developed based on experiences coming from decades of research data management at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). Initially developed as a data publication and documentation infrastructure for the IPK's role as a data center in the DataCite consortium, e!DAL has grown towards being a general data archiving and publication infrastructure. The e!DAL software has been deployed into the Maven Central Repository. Documentation and Software are also available at: http://edal.ipk-gatersleben.de.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Publication Process of Research Data Sample. The data-publication process (inspired by Gray et al. [11]) expresses the different manifestations of research data. At the top layer of the process, the journal, author, or scientist takes full responsibility for the publication, including the aggregated data embedded in it and the way the data is presented. For data published in the second layer, as supplementary files to articles, the link to the published “Record of Science” remains strong; but it is not always clear at what level the data is curated and preserved and if the criteria for discoverability and re-usability are met. At the Data Collections and Structured Database layer, the publication includes a citation and links to the data; but the data resides in and is the responsibility of a separate repository. At the bottom layer, most datasets remain unpublished and are consequently not accessible for later reanalysis.
Figure 2
Figure 2
The e!DAL data schema. Conceptual overview of the major entities of the e!DAL infrastructure.
Figure 3
Figure 3
Data publication workflow. To ensure a trusted release of research data, a review process has been designed. The first step for a data-publication request is the generation of a “landing page” for the applied citable identifier. The underlying URL is served by the embedded HTTP server. If a dataset has a release date in the future, the page locks the data download. If the user requested a DOI from DataCite, the system generates a unique DOI and migrates the metadata to DataCite-XML format. After the reviewer approves the publication, the DOI request is sent to the DataCite REST web service. Finally, the user gets an email notification with the accepted DOI or URL.
Figure 4
Figure 4
Architecture of e!DAL -API. The green nodes are the parts of the core e!DAL-API, the e!DAL-server, and the e!DAL-client packages. The yellow nodes represent the implementation interface, and the blue nodes represent the backend components. The red nodes symbolize possible applications.
Figure 5
Figure 5
Performance benchmark. Performance tests for local embedded (left) and server-client architecture (right). We used data sets with 10,000 files in 100 folders, but with different file sizes (0.1, 0.5, and 1.0 MB). The left y-axis shows the time required to store all of the objects and read them again to a new directory. The right y-axis shows the performance of the index-based search. Using the read/store test set, we sent queries, which gave exactly 10 results each. All tests were executed on a Linux system with a six-core AMD Phenom II X6 1055T Processor at 2.8 GHz and 64 MB heap space for the JAVA virtual machine. The system had a 1-GB ethernet connection and a SATA hard disk (7200 Rpm).
Figure 6
Figure 6
EdalFileChooser dialog. The eDAL-FileChooser dialog comprises several components as follows: (1) a file tree to navigate through the stored directories, (2) a window to display all files and subdirectories in the chosen folder, (3) textfields to display the meta information of the chosen version (to change the meta information, the user has to double click on a textfield), (4) a table to show all stored versions of a digital object (the user has to switch between the versions by marking a field), (5) a textfield to show the complete path of the current object, (6) a textfield for search function, and (7) open dialogs to change permissions or metadata.

References

    1. Craddock T, Harwood CR, Hallinan J, Wipat A. e-Science: relieving bottlenecks in large-scale genome analyses. Nat Rev Microbiol. 2008;6(12):248–954. - PubMed
    1. Brooksbank C, Bergman MT, Apweiler R, Birney E, Thornton J. The european bioinformatics institute’s data resources 2014. Nucleic Acids Res. 2013;42:D18–D25. doi:10.1093/nar/gkt1206. - PMC - PubMed
    1. Roos DS. Computational biology: bioinformatics–trying to swim in a sea of data. Science. 2001;291(5507):1260–1261. - PubMed
    1. Fernández-Suárez XM, Galperin MY. The 2013 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res. 2013;41(D1):1–7. - PMC - PubMed
    1. Kodama Y, Shumway M, Leinonen R. International nucleotide sequence database collaboration: the sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40(Database issue):D54–D56. doi:10.1093/nar/gkr854. - PMC - PubMed

Publication types