CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce
- PMID: 24897343
- PMCID: PMC4045712
- DOI: 10.1371/journal.pone.0098146
CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce
Abstract
Background: Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce.
Results: We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard.
Conclusions: CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users may collaborate to improve the source code of CloudDOE to further incorporate more MapReduce bioinformatics tools into CloudDOE and support next-generation big data open source tools, e.g., Hadoop BigTop and Spark.
Availability: CloudDOE is distributed under Apache License 2.0 and is freely available at http://clouddoe.iis.sinica.edu.tw/.
Conflict of interest statement
Figures






Similar articles
-
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-11-S12-S1. BMC Bioinformatics. 2010. PMID: 21210976 Free PMC article.
-
Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds.BMC Bioinformatics. 2012 Aug 13;13:200. doi: 10.1186/1471-2105-13-200. BMC Bioinformatics. 2012. PMID: 22888776 Free PMC article.
-
MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud.Bioinformatics. 2017 Sep 1;33(17):2762-2764. doi: 10.1093/bioinformatics/btx307. Bioinformatics. 2017. PMID: 28475668
-
Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.BioData Min. 2014 Oct 29;7:22. doi: 10.1186/1756-0381-7-22. eCollection 2014. BioData Min. 2014. PMID: 25383096 Free PMC article. Review.
-
How Heterogeneity Affects the Design of Hadoop MapReduce Schedulers: A State-of-the-Art Survey and Challenges.Big Data. 2018 Jun;6(2):72-95. doi: 10.1089/big.2018.0013. Big Data. 2018. PMID: 29924647 Review.
Cited by
-
cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud.Bioinformatics. 2016 Jan 15;32(2):301-3. doi: 10.1093/bioinformatics/btv553. Epub 2015 Oct 1. Bioinformatics. 2016. PMID: 26428290 Free PMC article.
-
Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework.BMC Genomics. 2015;16 Suppl 12(Suppl 12):S9. doi: 10.1186/1471-2164-16-S12-S9. Epub 2015 Dec 9. BMC Genomics. 2015. PMID: 26678408 Free PMC article.
-
HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.Algorithms Mol Biol. 2017 Sep 29;12:25. doi: 10.1186/s13015-017-0116-x. eCollection 2017. Algorithms Mol Biol. 2017. PMID: 29026435 Free PMC article.
-
Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services.J Am Med Inform Assoc. 2020 Sep 1;27(9):1425-1430. doi: 10.1093/jamia/ocaa068. J Am Med Inform Assoc. 2020. PMID: 32719837 Free PMC article.
-
Big Data Application in Biomedical Research and Health Care: A Literature Review.Biomed Inform Insights. 2016 Jan 19;8:1-10. doi: 10.4137/BII.S31559. eCollection 2016. Biomed Inform Insights. 2016. PMID: 26843812 Free PMC article. Review.
References
-
- Zou Q, Li XB, Jiang WR, Lin ZY, Li GL, et al... (2013) Survey of MapReduce frame operation in bioinformatics. Brief Bioinform. - PubMed
-
- Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51: 107–113.
-
- Welcome to Apache Hadoop! Available: http://hadoop.apache.org/.Accessed 2014 May 5.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources