Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 2;46(W1):W537-W544.
doi: 10.1093/nar/gky379.

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

Affiliations

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

Enis Afgan et al. Nucleic Acids Res. .

Abstract

Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Circular barplot illustrating recent growth of the Galaxy Project across several independent facets. In the past two years, usage of the main public Galaxy server has increased 60%, the number of tools and supported versions has increased 53%, and the amount of data analyzed on the main server has increased 72%. A growing number of public instances (18% increase) and cloud-based Galaxy instances (38% increase) provide researchers with a wider range of options for scalability and application domains. Additionally, more developers (45% increase with 63% more commits to the codebase) contributed to the Galaxy framework and software ecosystem. Question and answer activity on the Galaxy Biostars forum increased 68%.
Figure 2.
Figure 2.
Schematic of servers and services in use at Galaxy Main. (A) A global overview of Galaxy Main resources. When users interact with usegalaxy.org, their browser connects to one of two frontends (shown as web-01/02) with file uploads being handled by web-03/04; each of these web servers connects to a database server and mounts a set of shared distributed file systems. Web-03/04 also prepares and schedules jobs using Slurm directly to manage compute tasks on fifteen dedicated compute nodes, which also directly mount the shared distributed file systems. A combination of Slurm and Pulsar (https://github.com/galaxyproject/pulsar) are used to manage tasks and for dataset file staging, respectively, on the Jetstream cloud at Indiana University (IU) and the Texas Advanced Computing Center (TACC). Communication between Galaxy and Pulsar is handled using the RabbitMQ (https://www.rabbitmq.com/) message broker. Additional jobs are sent to the supercomputer systems Bridges at Pittsburgh Supercomputing Center (PSC) and Stampede at TACC using Pulsar. These various compute resources are chosen based upon tool and job characteristics. See, e.g. https://github.com/galaxyproject/usegalaxy-playbook/wiki/Infrastructure for specific and up-to-date information. (B) Multiple frontend servers provide Galaxy content to users by utilizing round-robin load balancing. Nginx (https://nginx.org/) is used to serve HTTP content from the Galaxy uWSGI web application. Individual software processes are monitored and controlled using Supervisor (http://supervisord.org/). Each of these frontend servers connects to a PostgreSQL (https://www.postgresql.org/) database server. (C) Layout of data schemes used by Galaxy Main is optimized for application speed, concurrent access, and versioned content. Each Galaxy frontend server utilizes a combination of shared distributed file systems, CVMFS for versioned semi-static content and TACC’s Corral filesystem via NFS for mutable content, along with server-specific local file systems. (D) CernVM File System (CVMFS) infrastructure hosted by the Galaxy Project that is used at Main and available for access to any other Galaxy instance. Stratum 0 contains the single-source modifiable data repositories. File content is served using the Apache HTTP server (https://httpd.apache.org/). To enable redundancy and scaling to a large number of clients, Stratum 1 replica servers are hosted at multiple locations and utilize Squid (http://www.squid-cache.org/) for data caching. Additional replica servers can also be hosted by community members. Individual clients (Galaxy instances and compute nodes) access data content from Stratum 1 servers using a Filesystem in Userspace (FUSE) mount.
Figure 3.
Figure 3.
Enabling automated selection and use of specialized national cyberinfrastructure compute resources from Galaxy Main enhances user-experience. It is now possible to run jobs that are up to an order of magnitude larger than before by using Bridges and Stampede. New types of jobs, such as interactive environments (see Advances in tools section), that require execution isolation due to security concerns are enabled by utilizing virtualization facilitated by the Jetstream cloud. Consequently, it is possible to concurrently run more jobs due to the increase in processing capacity.

References

    1. Giardine B., Riemer C., Hardison R.C., Burhans R., Elnitski L., Shah P., Zhang Y., Blankenberg D., Albert I., Taylor J. et al. . Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005; 15:1451–1455. - PMC - PubMed
    1. Blankenberg D., Taylor J., Schenck I., He J., Zhang Y., Ghent M., Veeraraghavan N., Albert I., Miller W., Makova K.D. et al. . A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome Res. 2007; 17:960–964. - PMC - PubMed
    1. Afgan E., Baker D., van den Beek M., Blankenberg D., Bouvier D., Čech M., Chilton J., Clements D., Coraor N., Eberhard C. et al. . The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 2016; 44:W3–W10. - PMC - PubMed
    1. Yang J., Tanaka Y., Seay M., Li Z., Jin J., Garmire L.X., Zhu X., Taylor A., Li W., Euskirchen G. et al. . Single cell transcriptomics reveals unanticipated features of early hematopoietic precursors. Nucleic Acids Res. 2017; 45:1281–1296. - PMC - PubMed
    1. Yoo A.B., Jette M.A., Grondona M.. SLURM: Simple Linux Utility for Resource Management. Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science. 2003; Berlin, Heidelberg: Springer; 44–60.

Publication types

MeSH terms