Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 12:11:900.
doi: 10.3389/fgene.2020.00900. eCollection 2020.

NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

Affiliations

NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

Luotong Wang et al. Front Genet. .

Abstract

Nanopore sequencing is regarded as one of the most promising third-generation sequencing (TGS) technologies. Since 2014, Oxford Nanopore Technologies (ONT) has developed a series of devices based on nanopore sequencing to produce very long reads, with an expected impact on genomics. However, the nanopore sequencing reads are susceptible to a fairly high error rate owing to the difficulty in identifying the DNA bases from the complex electrical signals. Although several basecalling tools have been developed for nanopore sequencing over the past years, it is still challenging to correct the sequences after applying the basecalling procedure. In this study, we developed an open-source DNA basecalling reviser, NanoReviser, based on a deep learning algorithm to correct the basecalling errors introduced by current basecallers provided by default. In our module, we re-segmented the raw electrical signals based on the basecalled sequences provided by the default basecallers. By employing convolution neural networks (CNNs) and bidirectional long short-term memory (Bi-LSTM) networks, we took advantage of the information from the raw electrical signals and the basecalled sequences from the basecallers. Our results showed NanoReviser, as a post-basecalling reviser, significantly improving the basecalling quality. After being trained on standard ONT sequencing reads from public E. coli and human NA12878 datasets, NanoReviser reduced the sequencing error rate by over 5% for both the E. coli dataset and the human dataset. The performance of NanoReviser was found to be better than those of all current basecalling tools. Furthermore, we analyzed the modified bases of the E. coli dataset and added the methylation information to train our module. With the methylation annotation, NanoReviser reduced the error rate by 7% for the E. coli dataset and specifically reduced the error rate by over 10% for the regions of the sequence rich in methylated bases. To the best of our knowledge, NanoReviser is the first post-processing tool after basecalling to accurately correct the nanopore sequences without the time-consuming procedure of building the consensus sequence. The NanoReviser package is freely available at https://github.com/pkubioinformatics/NanoReviser.

Keywords: DNA methylation; convolution neural network; deep learning; long short-term memory networks; nanopore sequencing; sequencing revising.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
Schematics of NanoReviser. (A) Structure of NanoReviser model building. (B) Structure of the main model. The preprocessed raw electrical signals (on the right) were passed through Identity Block and joined with the results of the preprocessed read input (on the left), passing through two bidirectional long short-term memory (Bi-LSTM) layers, and then the combination of the raw electrical signal features and read features were fed into the following Bi-LSTM layers. Finally, after the formation of two fully connected layers, the model gave a probability distribution of called bases. (C) Structure of Residential Block. Residential Block consisted of two convolutional layers and two batch normalization layers, which were used to accelerate the training speed. Conv stands for a convolutional layer and 1 × 3 was the size of the kernel used by the convolutional layer. (D) Structure of Identity Block. Identity Block consisted of three Residential Blocks. Conv stands for a convolutional layer and BN is the abbreviation for Batch Normalization.
FIGURE 2
FIGURE 2
NanoReviser fitting performances for various window sizes and over many iterations. (A) Total loss value. (B) Softmax cross-entropy loss value. (C) Center loss value.
FIGURE 3
FIGURE 3
Error rates of different error types on different genome areas. (A) Overall error rate. (B) Mismatch error rate. (C) Deletion error rate. NanoReviser (methylation) stands for NanoReviser trained with methylation information.

Similar articles

Cited by

References

    1. Ameur A., Kloosterman W. P., Hestand M. S. (2018). Single-molecule sequencing: towards clinical applications. Trends Biotechnol. 37 72–85. 10.1016/j.tibtech.2018.07.013 - DOI - PubMed
    1. Besser J., Carleton H. A., Gerner-Smidt P., Lindsey R. L., Trees E. (2018). Next-generation sequencing technologies and their application to the study and control of bacterial infections. Clin. Microbiol. Infect. 24 335–341. 10.1016/j.cmi.2017.10.013 - DOI - PMC - PubMed
    1. Bouthillier X., Konda K., Vincent P., Memisevic R. (2015). Dropout as data augmentation. arXiv [Preprint] Available online at: http://arxiv.org/abs/1506.08700 (accessed February 16, 2019).
    1. Boža V., Brejová B., Vinař T. (2017). DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads. PLoS One 12:e0178751. 10.1371/journal.pone.0178751 - DOI - PMC - PubMed
    1. Brown C. G., Clarke J. (2016). Nanopore development at Oxford Nanopore. Nat. Biotechnol. 34 810–811. 10.1038/nbt.3622 - DOI - PubMed