Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec 12:5:ELIXIR-2841.
doi: 10.12688/f1000research.10221.1. eCollection 2016.

Integration of EGA secure data access into Galaxy

Affiliations

Integration of EGA secure data access into Galaxy

Youri Hoogstrate et al. F1000Res. .

Abstract

High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. ELIXIR, a European research infrastructure for life-science data, initiated a project (2016 Human Data Implementation Study) to understand and document the ELIXIR requirements for secure management of controlled-access data. As part of this project, a full ecosystem was designed to connect archived raw experimental molecular profiling data with interpreted data and the computational workflows, using the CTMM Translational Research IT (CTMM-TraIT) infrastructure http://www.ctmm-trait.nl as an example. Here we present the first outcomes of this project, a framework to enable the download of EGA data to a Galaxy server in a secure way. Galaxy provides an intuitive user interface for molecular biologists and bioinformaticians to run and design data analysis workflows. More specifically, we developed a tool -- ega_download_streamer - that can download data securely from EGA into a Galaxy server, which can subsequently be further processed. This tool will allow a user within the browser to run an entire analysis containing sensitive data from EGA, and to make this analysis available for other researchers in a reproducible manner, as shown with a proof of concept study. The tool ega_download_streamer is available in the Galaxy tool shed: https://toolshed.g2.bx.psu.edu/view/yhoogstrate/ega_download_streamer.

Keywords: EGA; Galaxy; bioinformatics; data management; translational research; workflows.

PubMed Disclaimer

Conflict of interest statement

Competing interests: No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. A flowchart of the designed ecosystem for the management and storage of data for clinical research data with a focus on security.
The clinical data of an experiment describes the clinical-pathological data, including tissue and patient information. Descriptors of the samples combined with these variables are stored in tranSMART. Molecular profiling data are derived from samples of patients: these samples are processed in the laboratory to obtain tissue derivatives, such as isolation of DNA, RNA and proteins, which are subsequently analysed by high throughput experimental techniques to obtain the raw molecular profiling data; the descriptions of the performed experiments are also stored in tranSMART. The actual raw data produced by the high throughput analysis are physically stored in repositories like EGA, while the interpreted data processed by extensive computational workflows, and references to the raw data are stored in tranSMART. The ability to reanalyse the raw data, is provided by Galaxy. Note that the work described here indicated by red arrows implements a data connection, allowing a user to retrieve raw data from EGA in Galaxy, and run subsequent workflows, constructed by tools in the Galaxy tool shed.
Figure 2.
Figure 2.. STAR-Fusion workflow in Galaxy.
The workflow firstly, obtains the raw data from EGA, to subsequently allow reanalysis of the data in a workflow of multiple components, to derive interpreted data. The raw forward and backward FASTQ sequencing reads are imported from EGA by ega_download_streamer; subsequently, the tool FASTQ Groomer does a consistency check of the data formats; then with Sickle, low quality bases (Q<30) are trimmed and reads clipped into less than 25 bases are discarded, only outputting the high-quality sequencing reads. Afterwards, these reads are aligned to the hg19 (GenBank Assembly ID GCA_000001405.1) reference genome in RNA STAR. Then STAR-Fusion is used for predicting the fusion genes, which also requires two reference files as auxiliary inputs. The output goes through two filters to only keep predictions having more than two split reads and more than two spanning reads.

References

    1. Bierkens M, van der Linden W, van Bochove K, et al. : tranSMART. J Clin Bioinforma. 2015;5(Suppl 1):S9 10.1186/2043-9113-5-S1-S9 - DOI
    1. Lappalainen I, Almeida-King J, Kumanduri V, et al. : The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet. 2015;47(7):692–695. 10.1038/ng.3312 - DOI - PMC - PubMed
    1. Scheufele E, Aronzon D, Coopersmith R, et al. : tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform. AMIA Jt Summits Transl Sci Proc. 2014;2014:96–101. - PMC - PubMed
    1. Taylor J, Schenck I, Blankenberg D, et al. : Using galaxy to perform large-scale interactive data analyses. Curr Protoc Bioinformatics. 2007; Chapter 10:Unit 10.5. 10.1002/0471250953.bi1005s19 - DOI - PMC - PubMed
    1. Hillman-Jackson J, Clements D, Blankenberg D, et al. : Using Galaxy to perform large-scale interactive data analyses. Curr Protoc Bioinformatics. 2012; Chapter 10:Unit10.5. 10.1002/0471250953.bi1005s38 - DOI - PMC - PubMed

LinkOut - more resources