Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

Umberto Ferraro Petrillo¹, Gianluca Roscigno², Giuseppe Cattaneo², Raffaele Giancarlo³

Affiliations

¹ Dipartimento di Scienze Statistiche, Università di Roma - La Sapienza, Rome 00185, Italy.
² Dipartimento di Informatica, Università di Salerno, Fisciano, SA 84084, Italy.
³ Dipartimento di Matematica ed Informatica, Università di Palermo, Palermo 90133, Italy.

PMID: 29342232
DOI: 10.1093/bioinformatics/bty018

Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

Umberto Ferraro Petrillo et al. Bioinformatics. 2018.

. 2018 Jun 1;34(11):1826-1833.

doi: 10.1093/bioinformatics/bty018.

Authors

Umberto Ferraro Petrillo¹, Gianluca Roscigno², Giuseppe Cattaneo², Raffaele Giancarlo³

Affiliations

¹ Dipartimento di Scienze Statistiche, Università di Roma - La Sapienza, Rome 00185, Italy.
² Dipartimento di Informatica, Università di Salerno, Fisciano, SA 84084, Italy.
³ Dipartimento di Matematica ed Informatica, Università di Palermo, Palermo 90133, Italy.

PMID: 29342232
DOI: 10.1093/bioinformatics/bty018

Abstract

Motivation: Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes.

Results: Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.

Availability and implementation: The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH.

Contact: umberto.ferraro@uniroma1.it.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

Affiliations

Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

Authors

Affiliations

Abstract

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous