. 2021 Aug 3;16(8):e0255260.

doi: 10.1371/journal.pone.0255260. eCollection 2021.

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Altti Ilari Maarala¹, Ossi Arasalo², Daniel Valenzuela¹, Veli Mäkinen^{1

3}, Keijo Heljanko^{1

3}

Affiliations

¹ Department of Computer Science, University of Helsinki, Espoo, Finland.
² Department of Computer Science, Aalto University, Espoo, Finland.
³ Helsinki Institute for Information Technology, Espoo, Finland.

PMID: 34343181
PMCID: PMC8330939
DOI: 10.1371/journal.pone.0255260

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Altti Ilari Maarala et al. PLoS One. 2021.

. 2021 Aug 3;16(8):e0255260.

doi: 10.1371/journal.pone.0255260. eCollection 2021.

Authors

Altti Ilari Maarala¹, Ossi Arasalo², Daniel Valenzuela¹, Veli Mäkinen^{1

3}, Keijo Heljanko^{1

3}

Affiliations

¹ Department of Computer Science, University of Helsinki, Espoo, Finland.
² Department of Computer Science, Aalto University, Espoo, Finland.
³ Helsinki Institute for Information Technology, Espoo, Finland.

PMID: 34343181
PMCID: PMC8330939
DOI: 10.1371/journal.pone.0255260

Abstract

Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Distributed Relative Lempel-Ziv compression.**

**Fig 2. Distributed compression pipeline for hybrid-index.**

**Fig 3. Distributed hybrid-indexing with BLAST.**

**Fig 4. Sequence alignment with hybrid-index.**

**Fig 5. Summary of compressing and indexing complete human pan-genomes with distributed and non-distributed methods.**

See this image and copyright information in PMC

Cited by

Framing Apache Spark in life sciences.
Manconi A, Gnocchi M, Milanesi L, Marullo O, Armano G. Manconi A, et al. Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb. Heliyon. 2023. PMID: 36852030 Free PMC article. Review.

References

1. National Human Genome Research Institute. The Cost of Sequencing a Human Genome. 2020. Available from: https://www.genome.gov/sequencingcosts/
1. Suwinski P, Ong C, Ling M, Poh YM, Khan AM, Ong HS. Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics. Frontiers in genetics;10:49. - PMC - PubMed
1. Gu W, Miller S, Chiu CY. Clinical Metagenomic Next-Generation Sequencing for Pathogen Detection. Annual review of pathology;14:319–338. - PMC - PubMed
1. Papageorgiou L, Eleni P, Raftopoulou S, Mantaiou M, Megalooikonomou V, Vlachakis D. Genomic big data hitting the storage bottleneck. EMBnetjournal. 2018;24(0):910. - PMC - PubMed
1. Marcshall T, et al. Computational pan-genomics: Status, promises and challenges. The Computational Pan-Genomics Consortium Brief Bioinform. 2016;. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Affiliations

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous