Fast nanopore sequencing data analysis with SLOW5

Hasindu Gamaarachchi^{1

2}, Hiruna Samarakoon^{3

4}, Sasha P Jenner³, James M Ferguson³, Timothy G Amos³, Jillian M Hammond³, Hassaan Saadat⁴, Martin A Smith^{5

6}, Sri Parameswaran⁴, Ira W Deveson^{7

8}

Affiliations

¹ Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia. hasindu@garvan.org.au.
² School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia. hasindu@garvan.org.au.
³ Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia.
⁴ School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia.
⁵ CHU Sainte-Justine Research Centre, Montreal, Quebec, Canada.
⁶ Department of Biochemistry and Molecular Medicine, Faculty of Medicine, University of Montreal, Montreal, Quebec, Canada.
⁷ Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia. i.deveson@garvan.org.au.
⁸ St Vincent's Clinical School, Faculty of Medicine, University of New South Wales, Sydney, New South Wales, Australia. i.deveson@garvan.org.au.

PMID: 34980914
PMCID: PMC9287168
DOI: 10.1038/s41587-021-01147-4

Fast nanopore sequencing data analysis with SLOW5

Hasindu Gamaarachchi et al. Nat Biotechnol. 2022 Jul.

. 2022 Jul;40(7):1026-1029.

doi: 10.1038/s41587-021-01147-4. Epub 2022 Jan 3.

Authors

Affiliations

¹ Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia. hasindu@garvan.org.au.
² School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia. hasindu@garvan.org.au.
³ Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia.
⁴ School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia.
⁵ CHU Sainte-Justine Research Centre, Montreal, Quebec, Canada.
⁶ Department of Biochemistry and Molecular Medicine, Faculty of Medicine, University of Montreal, Montreal, Quebec, Canada.
⁷ Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, New South Wales, Australia. i.deveson@garvan.org.au.
⁸ St Vincent's Clinical School, Faculty of Medicine, University of New South Wales, Sydney, New South Wales, Australia. i.deveson@garvan.org.au.

PMID: 34980914
PMCID: PMC9287168
DOI: 10.1038/s41587-021-01147-4

Abstract

Nanopore sequencing depends on the FAST5 file format, which does not allow efficient parallel analysis. Here we introduce SLOW5, an alternative format engineered for efficient parallelization and acceleration of nanopore data analysis. Using the example of DNA methylation profiling of a human genome, analysis runtime is reduced from more than two weeks to approximately 10.5 h on a typical high-performance computer. SLOW5 is approximately 25% smaller than FAST5 and delivers consistent improvements on different computer architectures.

PubMed Disclaimer

Conflict of interest statement

I.W.D. manages a fee-for-service nanopore sequencing facility at the Garvan Institute of Medical Research, which is a customer of Oxford Nanopore Technologies but has no further financial relationship. H.G., H. Samarakoon, J.M.F., J.M.H. and M.A.S. have received travel and accommodation expenses to speak at Oxford Nanopore Technologies conferences. The authors declare no other competing interests.

Figures

**Fig. 1. SLOW5 format enables efficient parallel analysis of nanopore signal data.**
a, Schematic diagram illustrating the typical life cycle of nanopore data. Raw current signal data are generated on an ONT sequencing device and written in FAST5 format. Raw data are base-called into sequence reads (FASTQ/FASTA format). Downstream analysis involving both base-called reads and raw signal data is used to identify genetic variants, epigenetic modifications (for example, 5mC) and other features. b, Schematic diagram illustrating the bottleneck in ONT signal data analysis. FAST5 file reading requires the HDF5 software library, which serializes file access requests by multiple CPU threads, preventing efficient parallel analysis. SLOW5 files are not dependent on the HDF5 library and are amenable to efficient parallel analysis. A more detailed mechanistic diagram is provided in Extended Data Fig. 1e. c, Bar chart shows the relative file sizes (bytes per base) of a typical human genome sequencing dataset in ASCII SLOW5 (purple), binary BLOW5 format with no compression (orange), zlib compression (red) and vbz compression (pink), compared to FAST5 format with zlib compression (blue) and vbz compression (teal). d, Dot plots show the rate of file access (reads per second) for the above file types, as a function of CPU threads used on two HPC systems: HPC-HDD (left) or HPC-Lustre (right). e, Dot plots show the rate of execution (reads per second) for DNA methylation calling for the same file types on HPC-HDD (left) and HPC-Lustre (right). For the instance of maximum CPU threads, bar charts show the time consumed by individual workflow components: FAST5/SLOW5 data access (pink), FASTA data access (teal), BAM data access (orange) and data processing (navy). f, Bar charts show the time consumed by data access (pink) and data processing (navy) during DNA methylation calling on a range of different computer systems. Full specifications are provided in Supplementary Table 2. Source data

**Extended Data Fig. 1. Inefficient parallel access is a major bottleneck in analysis of FAST5 files.**
**(a)** Bar chart shows the time consumed by individual components of a Nanopolish DNA methylation calling job with signal data input in FAST5 format: FAST5 data access (pink), FASTA data access (teal), BAM data access (orange) and data processing (navy). To assess the impact of multi-threading, the analysis was run with various numbers of CPU threads on the HPC-HDD system (see Supplementary Table 2). The analysis was run on a downsampled human genome sequencing dataset of 500,000 reads (see Supplementary Table 1). **(b)** Dot plots show the rate of file access and processing (reads / second) during the DNA methylation calling job above, as a function of CPU threads used. (c,d) Bar charts show the proportional CPU utilisation **(c)** and total core hours **(d)** during the DNA methylation calling jobs above. The definition of core-hours is provided in the Methods section. **(e)** The upper schematic illustrates the architecture of a job with multi-threaded synchronous file access (I/O). The lower schematic illustrates the bottleneck created by the HDF5 library that is required to read FAST5 files. The HDF5 library serialises I/O requests, making multi-threaded analysis highly inefficient and causing the observed decline in CPU utilisation with increasing numbers of CPU threads. **(f)** Schematic illustrates the architecture of a multi-processing approach that was implemented to circumvent this limitation in the HDF5 library. The multi-processing approach is viable but requires challenging software engineering and is not a generalisable, long-term solution.

**Extended Data Fig. 2. Impact of read length on file sizes for FAST5 vs BLOW5 files.**
(a,b) Dot plot show relative file sizes (bytes / base) of various datasets (see Supplementary Table 3) as a function of mean read length (shown on a log2 scale). File sizes are shown separately for FAST5-zlib vs BLOW5-zlib **(a)** and FAST5-vbz vs BLOW5-vbz **(b)** formats. File sizes are highly variable among different FAST5 files and largely stable among BLOW5 files. Libraries that have the shortest read lengths exhibit the largest space-savings, regardless of compression type.

**Extended Data Fig. 3. Performance metrics for DNA methylation profiling with FAST5 / SLOW5 files.**
(a,b) Dot plots show the rate of data processing (reads / second) during DNA methylation calling with ASCII SLOW5 (purple), binary BLOW5 (orange), BLOW5-zlib (red) and FAST5-zlib (blue) files as a function of CPU threads. Analysis was performed on two HPC architectures: HPC-HDD **(a)** or HPC-Lustre (b; see Supplementary Table 2). (c,d) Dot plots show the ratio of data-processing time relative to total execution time for the jobs above. (e,f) Bar charts show the proportional CPU utilisation **(e)** and total core hours **(f)** during the DNA methylation calling with BLOW5-zlib on the HPC-HDD system. The definition of core-hours is provided in the Methods section.

**Extended Data Fig. 4. FAST5 to SLOW5 data conversion performance.**
**(a)** Dot plot shows the time take to convert a downsampled human genome sequencing dataset of 500,000 reads (see Supplementary Table 1) from FAST5 format to ASCII SLOW5 (purple), binary BLOW5 (orange) and compressed BLOW5-zlib (red) formats as a function of CPU threads used on the HPC-HDD system (see Supplementary Table 2). Parallel file conversion was achieved using a multi-processing approach described in the Methods section. **(b)** Dot plot shows the time taken to merge the individual files from **(a)** into a single SLOW5/BLOW5 file. **(c)** Curves show the progress of data generation (purple) and FAST5 to BLOW5-zlib conversion (orange) during a sequencing run on an ONT PromethION device with live conversion enabled. As evident, all reads are converted within minutes of availability and the entire dataset is converted to BLOW5-zlib format at the sequencing run completion.

See this image and copyright information in PMC

References

1. Deamer D, Akeson M, Branton D. Three decades of nanopore sequencing. Nat. Biotechnol. 2016;34:518–524. doi: 10.1038/nbt.3423. - DOI - PMC - PubMed
1. Ashton PM, et al. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat. Biotechnol. 2015;33:296–300. doi: 10.1038/nbt.3103. - DOI - PubMed
1. Charalampous T, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat. Biotechnol. 2019;37:783–792. doi: 10.1038/s41587-019-0156-5. - DOI - PubMed
1. Jain M, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 2018;36:338–345. doi: 10.1038/nbt.4060. - DOI - PMC - PubMed
1. Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods. 2015;12:733–735. doi: 10.1038/nmeth.3444. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast nanopore sequencing data analysis with SLOW5

Affiliations

Fast nanopore sequencing data analysis with SLOW5

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources