NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

Luotong Wang¹, Li Qu^{1

2}, Longshu Yang³, Yiying Wang¹, Huaiqiu Zhu^{1

2

3}

Affiliations

¹ State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, China.
² Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, United States.
³ Center for Quantitative Biology, Peking University, Beijing, China.

PMID: 32903372
PMCID: PMC7434944
DOI: 10.3389/fgene.2020.00900

NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

Luotong Wang et al. Front Genet. 2020.

. 2020 Aug 12:11:900.

doi: 10.3389/fgene.2020.00900. eCollection 2020.

Authors

Luotong Wang¹, Li Qu^{1

2}, Longshu Yang³, Yiying Wang¹, Huaiqiu Zhu^{1

2

3}

Affiliations

¹ State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, China.
² Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, United States.
³ Center for Quantitative Biology, Peking University, Beijing, China.

PMID: 32903372
PMCID: PMC7434944
DOI: 10.3389/fgene.2020.00900

Abstract

Nanopore sequencing is regarded as one of the most promising third-generation sequencing (TGS) technologies. Since 2014, Oxford Nanopore Technologies (ONT) has developed a series of devices based on nanopore sequencing to produce very long reads, with an expected impact on genomics. However, the nanopore sequencing reads are susceptible to a fairly high error rate owing to the difficulty in identifying the DNA bases from the complex electrical signals. Although several basecalling tools have been developed for nanopore sequencing over the past years, it is still challenging to correct the sequences after applying the basecalling procedure. In this study, we developed an open-source DNA basecalling reviser, NanoReviser, based on a deep learning algorithm to correct the basecalling errors introduced by current basecallers provided by default. In our module, we re-segmented the raw electrical signals based on the basecalled sequences provided by the default basecallers. By employing convolution neural networks (CNNs) and bidirectional long short-term memory (Bi-LSTM) networks, we took advantage of the information from the raw electrical signals and the basecalled sequences from the basecallers. Our results showed NanoReviser, as a post-basecalling reviser, significantly improving the basecalling quality. After being trained on standard ONT sequencing reads from public E. coli and human NA12878 datasets, NanoReviser reduced the sequencing error rate by over 5% for both the E. coli dataset and the human dataset. The performance of NanoReviser was found to be better than those of all current basecalling tools. Furthermore, we analyzed the modified bases of the E. coli dataset and added the methylation information to train our module. With the methylation annotation, NanoReviser reduced the error rate by 7% for the E. coli dataset and specifically reduced the error rate by over 10% for the regions of the sequence rich in methylated bases. To the best of our knowledge, NanoReviser is the first post-processing tool after basecalling to accurately correct the nanopore sequences without the time-consuming procedure of building the consensus sequence. The NanoReviser package is freely available at https://github.com/pkubioinformatics/NanoReviser.

Keywords: DNA methylation; convolution neural network; deep learning; long short-term memory networks; nanopore sequencing; sequencing revising.

PubMed Disclaimer

Figures

**FIGURE 1**
Schematics of NanoReviser. **(A)** Structure of NanoReviser model building. **(B)** Structure of the main model. The preprocessed raw electrical signals (on the right) were passed through Identity Block and joined with the results of the preprocessed read input (on the left), passing through two bidirectional long short-term memory (Bi-LSTM) layers, and then the combination of the raw electrical signal features and read features were fed into the following Bi-LSTM layers. Finally, after the formation of two fully connected layers, the model gave a probability distribution of called bases. **(C)** Structure of Residential Block. Residential Block consisted of two convolutional layers and two batch normalization layers, which were used to accelerate the training speed. Conv stands for a convolutional layer and 1 × 3 was the size of the kernel used by the convolutional layer. **(D)** Structure of Identity Block. Identity Block consisted of three Residential Blocks. Conv stands for a convolutional layer and BN is the abbreviation for Batch Normalization.

**FIGURE 2**
NanoReviser fitting performances for various window sizes and over many iterations. **(A)** Total loss value. **(B)** Softmax cross-entropy loss value. **(C)** Center loss value.

**FIGURE 3**
Error rates of different error types on different genome areas. **(A)** Overall error rate. **(B)** Mismatch error rate. **(C)** Deletion error rate. NanoReviser (methylation) stands for NanoReviser trained with methylation information.

See this image and copyright information in PMC

References

1. Ameur A., Kloosterman W. P., Hestand M. S. (2018). Single-molecule sequencing: towards clinical applications. Trends Biotechnol. 37 72–85. 10.1016/j.tibtech.2018.07.013 - DOI - PubMed
1. Besser J., Carleton H. A., Gerner-Smidt P., Lindsey R. L., Trees E. (2018). Next-generation sequencing technologies and their application to the study and control of bacterial infections. Clin. Microbiol. Infect. 24 335–341. 10.1016/j.cmi.2017.10.013 - DOI - PMC - PubMed
1. Bouthillier X., Konda K., Vincent P., Memisevic R. (2015). Dropout as data augmentation. arXiv [Preprint] Available online at: http://arxiv.org/abs/1506.08700 (accessed February 16, 2019).
1. Boža V., Brejová B., Vinař T. (2017). DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads. PLoS One 12:e0178751. 10.1371/journal.pone.0178751 - DOI - PMC - PubMed
1. Brown C. G., Clarke J. (2016). Nanopore development at Oxford Nanopore. Nat. Biotechnol. 34 810–811. 10.1038/nbt.3622 - DOI - PubMed

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

Affiliations

NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous