Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 25:23:3430-3444.
doi: 10.1016/j.csbj.2024.09.016. eCollection 2024 Dec.

BaseNet: A transformer-based toolkit for nanopore sequencing signal decoding

Affiliations

BaseNet: A transformer-based toolkit for nanopore sequencing signal decoding

Qingwen Li et al. Comput Struct Biotechnol J. .

Abstract

Nanopore sequencing provides a rapid, convenient and high-throughput solution for nucleic acid sequencing. Accurate basecalling in nanopore sequencing is crucial for downstream analysis. Traditional approaches such as Hidden Markov Models (HMM), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN) have improved basecalling accuracy but there is a continuous need for higher accuracy and reliability. In this study, we introduce BaseNet (https://github.com/liqingwen98/BaseNet), an open-source toolkit that utilizes transformer models for advanced signal decoding in nanopore sequencing. BaseNet incorporates both autoregressive and non-autoregressive transformer-based decoding mechanisms, offering state-of-the-art algorithms freely accessible for future improvement. Our research indicates that cross-attention weights effectively map the relationship between current signals and base sequences, joint loss training through adding a pair of forward and reverse decoder facilitate model converge, and large-scale pre-trained models achieve superior decoding accuracy. This study helps to advance the field of nanopore sequencing signal decoding, contributes to technological advancements, and provides novel concepts and tools for researchers and practitioners.

Keywords: Basecall; Machine learning algorithm; Nanopore sequencing; Transformer.

PubMed Disclaimer

Conflict of interest statement

Daqian Wang and Jizhong Lou are co-founders and shareholders of Beijing Polyseq Biotech Co. Ltd. Beijing Polyseq Biotech Co. Ltd. and Institute of Biophysics, Chinese Academy of Sciences have filed a patent using materials described in this article.

Figures

ga1
Graphical abstract
Fig. 1
Fig. 1
Summary of benchmark dataset characteristics. (a) Data size statistics of each species in the dataset. (b) Read length distribution statistics of each species in the dataset. (c) Statistics of different bases proportion in the dataset.
Fig. 2
Fig. 2
Fast attention mechanism in BaseNet. This schematic illustrates the fast attention mechanism used in BaseNet. The method involves training separate matrices, α and β, each with dimensions [N, 1], for Q and K, respectively. These matrices are used to perform a weighted summations on Q and K, resulting in transformed matrices Q' and K'. The transformed matrices are then multiplied together, reducing the computational complexity to O(N·d).
Fig. 3
Fig. 3
Autoregressive Transformer-based model architecture for nanopore sequencing. The schematic illustrates the architecture of the autoregressive transformer model tailored for nanopore sequencing data. The model includes convolutional modules for feature extraction and down-sampling, an encoder composed of 8 layers for context modeling, and a decoder with 8 layers for sequence generation.
Fig. 4
Fig. 4
Self-supervised large-scale model architecture in BaseNet. The schematic illustrates the self-supervised large-scale model architecture developed in BaseNet. The model comprises three key components: a feature extraction module, an encoder module for context modeling, and a quantization module for learning discrete common features through self-supervised pre-training.
Fig. 5
Fig. 5
Rescore and joint loss training model in BaseNet. This schematic illustrates the rescore and joint loss training model used in BaseNet. The model consists of three main components: a shared encoder, a CTC decoder and attention decoders. The training process utilized a joint loss to optimize performance.
Fig. 6
Fig. 6
Paraformer Architecture in BaseNet. The schematic depicts the architecture of the Paraformer model developed in BaseNet. The model features an encoder for generating hidden representations, a predictor for producing acoustic embeddings and predicting sequence lengths, a sampler for randomly creating semantic embeddings, and a decoder for generating outputs.
Fig. 7
Fig. 7
Performance of BaseNet Under different learning rates and restricted weights. (a) Learning rate variation under three schedulers: CosineDecay, Noam, and WarmupLR. The Warmup step is set to 1000 and the total step to 10,000. (b) Autoregressive transformer prediction performance with different restricted weights. Performance significantly reduces with no EOS constraint (w=0) or strict constraints (w>=0.8). Optimal performance is observed within the range of w values from 0.1 to 0.7.
Fig. 8
Fig. 8
Cross-attention weights between current signals and base sequences in transformer decoder. (a) Visualization of cross-attention weights across different decoder layers for a specific sequence. The linear relationship between signal and sequence follows a weak-strong-weak pattern among layers. Layers four to six exhibit the strongest linear relationships, indicating that they effectively capture both local and global features. In contrast, layers one to three and seven to eight show no or weaker linear relationships, indicating a focus on either local or global features without integrating both. (b) Visualization of cross-attention weights across different sequences in decoder layer four.
Fig. 9
Fig. 9
Performance comparison of different large-scale models. (a) Performance comparison of different fine-tuned models. Among the 7 fine-tuned models, the model with one additional linear layer achieved the best performance, reaching 95.86 % identity after 10 epochs of fine-tuning. (b) Accuracy comparison of large models under different training conditions as indicated. (c) Training loss curves for signal fine-tuning, speech fine-tuning, and training from scratch.
Fig. 10
Fig. 10
Performance comparison of different basecallers. (a) Identity comparison of different basecallers. (b) Identity comparison of different basecallers under different sequence lengths. Error bars represent standard errors. Fine-tuned model performs best across all lengths, and when the sequence length exceeds 20,000, the Joint-CTC model outperforms the Bonito-CRF model.

Similar articles

Cited by

References

    1. Jain M., Koren S., Miga K.H., Quick J., Rand A.C., Sasani T.A., et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36(4):338–345. doi: 10.1038/nbt.4060. - DOI - PMC - PubMed
    1. Davenport C.F., Scheithauer T., Dunst A., Bahr F.S., Dorda M., Wiehlmann L., et al. Genome-Wide Methylation Mapping Using Nanopore Sequencing Technology Identifies Novel Tumor Suppressor Genes in Hepatocellular Carcinoma. Int J Mol Sci. 2021;22(8):3937. https://www.mdpi.com/1422-0067/22/8/3937 - PMC - PubMed
    1. Quick J., Loman N.J., Duraffour S., Simpson J.T., Severi E., Cowley L., et al. Real-time, portable genome sequencing for Ebola surveillance. Nature. 2016;530(7589):228–232. doi: 10.1038/nature16996. - DOI - PMC - PubMed
    1. Wang J., Moore N.E., Deng Y.M., Eccles D.A., Hall R.J. MinION nanopore sequencing of an influenza genome. Front Microbiol. 2015;6:766. doi: 10.3389/fmicb.2015.00766. - DOI - PMC - PubMed
    1. Faria N.R., Quick J., Claro I.M., Thézé J., de Jesus J.G., Giovanetti M., et al. Establishment and cryptic transmission of Zika virus in Brazil and the Americas. Nature. 2017;546(7658):406–410. doi: 10.1038/nature22401. - DOI - PMC - PubMed

LinkOut - more resources