Review

. 2020 Feb 7;21(1):30.

doi: 10.1186/s13059-020-1935-5.

Opportunities and challenges in long-read sequencing data analysis

Shanika L Amarasinghe^{1

2}, Shian Su^{1

2}, Xueyi Dong^{1

2}, Luke Zappia^{3

4}, Matthew E Ritchie^{1

2

5}, Quentin Gouil^{6

7}

Affiliations

¹ Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, 3052, Australia.
² Department of Medical Biology, The University of Melbourne, Parkville, 3010, Australia.
³ Bioinformatics, Murdoch Children's Research Institute, Parkville, 3052, Australia.
⁴ School of Biosciences, Faculty of Science, The University of Melbourne, Parkville, 3010, Australia.
⁵ School of Mathematics and StatisticsThe University of Melbourne, Parkville, 3010, Australia.
⁶ Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, 3052, Australia. gouil.q@wehi.edu.au.
⁷ Department of Medical Biology, The University of Melbourne, Parkville, 3010, Australia. gouil.q@wehi.edu.au.

PMID: 32033565
PMCID: PMC7006217
DOI: 10.1186/s13059-020-1935-5

Review

Opportunities and challenges in long-read sequencing data analysis

Shanika L Amarasinghe et al. Genome Biol. 2020.

. 2020 Feb 7;21(1):30.

doi: 10.1186/s13059-020-1935-5.

Authors

Shanika L Amarasinghe^{1

2}, Shian Su^{1

2}, Xueyi Dong^{1

2}, Luke Zappia^{3

4}, Matthew E Ritchie^{1

2

5}, Quentin Gouil^{6

7}

Affiliations

¹ Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, 3052, Australia.
² Department of Medical Biology, The University of Melbourne, Parkville, 3010, Australia.
³ Bioinformatics, Murdoch Children's Research Institute, Parkville, 3052, Australia.
⁴ School of Biosciences, Faculty of Science, The University of Melbourne, Parkville, 3010, Australia.
⁵ School of Mathematics and StatisticsThe University of Melbourne, Parkville, 3010, Australia.
⁶ Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, 3052, Australia. gouil.q@wehi.edu.au.
⁷ Department of Medical Biology, The University of Melbourne, Parkville, 3010, Australia. gouil.q@wehi.edu.au.

PMID: 32033565
PMCID: PMC7006217
DOI: 10.1186/s13059-020-1935-5

Abstract

Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.

Keywords: Data analysis; Long-read sequencing; Oxford Nanopore; PacBio.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Overview of long-read analysis tools and pipelines. a Release of tools identified from various sources and milestones of long-read sequencing. b Functional categories. c Typical long-read analysis pipelines for SMRT and nanopore data. Six main stages are identified through the presented workflow (i.e. basecalling, quality control, read error correction, assembly/alignment, assembly refinement, and downstream analyses). The green-coloured boxes represent processes common to both short-read and long-read analyses. The orange-coloured boxes represent the processes unique to long-read analyses. Unfilled boxes represent optional steps. Commonly used tools for each step in long-read analysis are within brackets. Italics signify tools developed by either PacBio or ONT companies, and non-italics signify tools developed by external parties. Arrows represent the direction of the workflow

**Fig. 2**
Paradigms of error correction (a) and polishing (b). Errors in long reads and assembly are denoted by red crosses. Non-hybrid methods only require long reads, while hybrid methods additionally require accurate short reads (purple)

**Fig. 3**
Methods to detect base modifications in long-read sequencing. Base modifications can be inferred from their effect on the current intensity (nanopore) and inter-pulse duration (IPD, SMRT). Strategies to call base modifications in nanopore sequencing and the corresponding tools are further depicted

**Fig. 4**
Types of transcriptomic analyses and their steps. The choice of sequencing protocol amongst the six available workflows affects the type, characteristics, and quantity of data generated. Only direct RNA sequencing allows epitranscriptomic studies, but SMRT direct RNA sequencing is a custom technique that is not fully supported. The remaining non-exclusive applications are isoform detection, quantification, and differential analysis. The dashed lines in arrows represent upstream processes to transcriptomics

See this image and copyright information in PMC

References

1. Pollard MO, Gurdasani D, Mentzer AJ, Porter T, Sandhu MS. Long reads: their purpose and place. Hum Mol Genet. 2018;27(R2):234–41. doi: 10.1093/hmg/ddy177. - DOI - PMC - PubMed
1. Burgess DJ. Genomics: next regeneration sequencing for reference genomes. Nat Rev Genet. 2018;19(3):125. doi: 10.1038/nrg.2018.5. - DOI - PubMed
1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–9. doi: 10.1038/nature07517. - DOI - PMC - PubMed
1. Bentley DR. Whole-genome re-sequencing. Curr Opin Genet Dev. 2006;16(6):545–52. doi: 10.1016/j.gde.2006.10.009. - DOI - PubMed
1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Opportunities and challenges in long-read sequencing data analysis

Affiliations

Opportunities and challenges in long-read sequencing data analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources