. 2018 Jul 31;13(7):e0201483.

doi: 10.1371/journal.pone.0201483. eCollection 2018.

HSRA: Hadoop-based spliced read aligner for RNA sequencing data

Roberto R Expósito¹, Jorge González-Domínguez¹, Juan Touriño¹

Affiliations

PMID: 30063721
PMCID: PMC6067734
DOI: 10.1371/journal.pone.0201483

HSRA: Hadoop-based spliced read aligner for RNA sequencing data

Roberto R Expósito et al. PLoS One. 2018.

. 2018 Jul 31;13(7):e0201483.

doi: 10.1371/journal.pone.0201483. eCollection 2018.

Authors

Roberto R Expósito¹, Jorge González-Domínguez¹, Juan Touriño¹

Affiliation

¹ Computer Architecture Group, Universidade da Coruña, Campus de Elviña, 15071 A Coruña, Spain.

PMID: 30063721
PMCID: PMC6067734
DOI: 10.1371/journal.pone.0201483

Abstract

Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user's guide are publicly available for download at http://hsra.dec.udc.es.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Overall workflow of the MapReduce paradigm.**
This workflow shows several map and reduce tasks working in parallel over different input splits.

**Fig 2. Overview of the HSRA workflow for single-end alignment.**
This mode executes a map-only job taking advantage of the HSP library to parse the reads directly from HDFS. Native pipes are used for efficient IPC communication between Hadoop and HISAT2.

**Fig 3. Overview of the HSRA workflow for paired-end alignment using the reduce-side join approach.**
This approach executes a MapReduce job using the single-end support provided by the HSP library, where a reduce-side join is needed to obtain the paired-end reads. Native pipes are used for efficient IPC communication between Hadoop and HISAT2.

**Fig 4. Overview of the HSRA workflow for paired-end alignment using the map-side join approach.**
This approach allows avoiding any data shuffling by executing a map-only job thanks to the specific support for paired-end datasets provided by the HSP library. Native pipes are used for efficient IPC communication between Hadoop and HISAT2.

**Fig 5. Experimental results for single-end alignment.**
Runtime results obtained by *HSRA* when varying the number of nodes using the (a) SRR1 and (b) DRR1 datasets.

**Fig 6. Experimental results for paired-end alignment (SRR1 dataset).**
Runtime results obtained by *HSRA* when varying the number of nodes using the (a) reduce-side and (b) map-side join approaches.

**Fig 7. Experimental results for paired-end alignment (DRR1 dataset).**
Runtime results obtained by *HSRA* when varying the number of nodes using the (a) reduce-side and (b) map-side join approaches.

See this image and copyright information in PMC

Cited by

Integrated Genome and Transcriptome Sequencing to Solve a Neuromuscular Puzzle: Miyoshi Muscular Dystrophy and Early Onset Primary Dystonia in Siblings of the Same Family.
Zhu F, Zhang F, Hu L, Liu H, Li Y. Zhu F, et al. Front Genet. 2021 Jul 2;12:672906. doi: 10.3389/fgene.2021.672906. eCollection 2021. Front Genet. 2021. PMID: 34276779 Free PMC article.
BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. Chen J, et al. Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021. Front Big Data. 2022. PMID: 35118375 Free PMC article.
SparkEC: speeding up alignment-based DNA error correction tools.
Expósito RR, Martínez-Sánchez M, Touriño J. Expósito RR, et al. BMC Bioinformatics. 2022 Nov 7;23(1):464. doi: 10.1186/s12859-022-05013-1. BMC Bioinformatics. 2022. PMID: 36344928 Free PMC article.
Cloud accelerated alignment and assembly of full-length single-cell RNA-seq data using Falco.
Yang A, Kishore A, Phipps B, Ho JWK. Yang A, et al. BMC Genomics. 2019 Dec 30;20(Suppl 10):927. doi: 10.1186/s12864-019-6341-6. BMC Genomics. 2019. PMID: 31888474 Free PMC article.

References

1. Pepke S, Wold B, Mortazavi A. Computation for ChIP-seq and RNA-seq studies. Nat Methods. 2009;6(11):S22–S32. 10.1038/nmeth.1371 - DOI - PMC - PubMed
1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. 10.1038/nrg2484 - DOI - PMC - PubMed
1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–1517. 10.1101/gr.079558.108 - DOI - PMC - PubMed
1. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–113. 10.1145/1327452.1327492 - DOI
1. Zou Q, Li XB, Jiang WR, Lin ZY, Li GL, Chen K. Survey of MapReduce frame operation in bioinformatics. Brief Bioinform. 2013;15(4):637–647. 10.1093/bib/bbs088 - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HSRA: Hadoop-based spliced read aligner for RNA sequencing data

Affiliation

HSRA: Hadoop-based spliced read aligner for RNA sequencing data

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials