Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 4;9(6):e98146.
doi: 10.1371/journal.pone.0098146. eCollection 2014.

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce

Affiliations

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce

Wei-Chun Chung et al. PLoS One. .

Abstract

Background: Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce.

Results: We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard.

Conclusions: CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users may collaborate to improve the source code of CloudDOE to further incorporate more MapReduce bioinformatics tools into CloudDOE and support next-generation big data open source tools, e.g., Hadoop BigTop and Spark.

Availability: CloudDOE is distributed under Apache License 2.0 and is freely available at http://clouddoe.iis.sinica.edu.tw/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Software solutions of CloudDOE.
A user can deploy a Hadoop Cloud, operate the supported bioinformatics MapReduce programs, and extend cloud functions through installing new tools.
Figure 2
Figure 2. Comparison of CloudDOE and traditional approaches.
CloudDOE encapsulates complicated procedures of traditional approaches into graphical user-friendly interfaces. Nearly 50% of the manipulating steps are reduced compared to traditional approaches.
Figure 3
Figure 3. Screenshots of Deploy wizard.
(A) Brief instructions to explain the system requirements and procedures that Deploy wizard will perform. A user is prompted (B) to provide information of the connection between the local PC and the Hadoop cloud and (C) to set up information of the Hadoop cloud, including IP addresses and a username/password. (D) Settings and configurations of the target cloud are generated automatically. The installation progress and logs can also be monitored on the wizard.
Figure 4
Figure 4. A structured XML configuration file and the generated wizard.
The configuration file contains a metadata section on general program information, a set of parameters and its default values that are necessary to execute the program, and sections on log files and result download methods. CloudDOE loads a configuration file and generates the specific wizard required.
Figure 5
Figure 5. Screenshots of Operate wizard.
A user can (A) log in to their Hadoop cloud, (B) upload and manage input data, (C) configure program parameters, and thus submit and monitor an execution, and (D) download the results after execution is completed.
Figure 6
Figure 6. System architecture of CloudDOE.
The solid square represents a machine or a computing resource, and the gray solid square is the master of the Hadoop cloud. CloudDOE establishes Secure Shell (SSH) channels for communication and acquires local resources for operations.

Similar articles

Cited by

References

    1. Zou Q, Li XB, Jiang WR, Lin ZY, Li GL, et al... (2013) Survey of MapReduce frame operation in bioinformatics. Brief Bioinform. - PubMed
    1. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51: 107–113.
    1. Welcome to Apache Hadoop! Available: http://hadoop.apache.org/.Accessed 2014 May 5.
    1. Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11 Suppl 12S1. - PMC - PubMed
    1. Schatz MC (2009) CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25: 1363–1369. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources