Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 6;22(1):261.
doi: 10.1186/s13059-021-02472-2.

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Affiliations

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Mian Umair Ahsan et al. Genome Biol. .

Abstract

Long-read sequencing enables variant detection in genomic regions that are considered difficult-to-map by short-read sequencing. To fully exploit the benefits of longer reads, here we present a deep learning method NanoCaller, which detects SNPs using long-range haplotype information, then phases long reads with called SNPs and calls indels with local realignment. Evaluation on 8 human genomes demonstrates that NanoCaller generally achieves better performance than competing approaches. We experimentally validate 41 novel variants in a widely used benchmarking genome, which could not be reliably detected previously. In summary, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing.

Keywords: Deep learning; Difficult-to-map regions; Long-range haplotype; Variant calling.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
An example on how to construct input features for a SNP candidate site. a Reference sequence and read pileups at candidate site b and at other genomic positions that share the same reads. The columns in gray are genomic positions that will not be used in input features for candidate site b as they do not satisfy the criteria for being highly likely heterozygous SNP sites. Only the columns with colored bases will be used to generate input features for site b and will constitute the set Z as described in the SNP pileups generation section of "Methods". These neighboring likely heterozygous sites can be up to thousands of bases away from candidate site b. b Reference sequence and read pileups for only the candidate site and neighboring highly likely heterozygous SNP sites. c Raw counts of bases at sites in the set Z for each read group split by the nucleotide types at site b. These raw counts are multiplied with negative signs for reference bases. d Flattened pileup image with fifth channel after reference sequence row is added. e Pileup image used as input features for NanoCaller deep convolutional neural network
Fig. 2
Fig. 2
An example on how to construct input features for an indel candidate site. a Reference sequence and read pileup at the candidate site before and after multiple sequence alignment, and the consensus sequence. b Reference sequence and consensus sequence at the candidate site before and after pairwise alignment, and the inferred sequence. c Raw counts of each symbol at each column of multiple sequence alignment pileup. d Matrix M, showing frequency of each symbol at each column of multiple sequence alignment pileup. e First channel of input image, matrix M minus Q (one-hot encoding of realigned reference sequence). f Matrix Q, one-hot encoding of realigned reference sequence which forms the second channel of input image for NanoCaller deep convolutional neural network
Fig. 3
Fig. 3
Performances of NanoCaller and other variant callers on ten ONT datasets. SNP performance on whole-genome high-confidence intervals: a precision, b recall, c F1 score. F1 scores of SNP performances on d “all difficult-to-map” regions and e MHC. Indel performance non-homopolymer regions: f precision, g recall, h F1 score. i: F1 score of indel performance in whole-genome high-confidence intervals
Fig. 4
Fig. 4
Performances of NanoCaller and other variant callers on four PacBio CCS and four PacBio CLR datasets. SNP performance on whole-genome high-confidence intervals using CCS reads: a precision, b recall, c F1 score. Indel performance on whole-genome high-confidence intervals using CCS reads: d precision, e recall, f F1 score. SNP performance on whole-genome high-confidence intervals using CLR reads: g precision, h recall, i F1 score
Fig. 5
Fig. 5
Evidence for a novel multiallelic SNP. a IGV plots of Nanopore, PacBio CCS, and Illumina reads of HG002 genome at chr3:5336450-5336480. b Sanger sequencing signal data for the same region. NanoCaller on older HG002 data correctly identified the multiallelic SNP at chr3:5336450 (A>T,C) shown in black box
Fig. 6
Fig. 6
Evidence for novel deletions. a IGV plots of Nanopore, PacBio CCS, and Illumina reads of HG002 genome at chr9:135663780-135663850. The 40-bp-long deletion shown below in black box was identified using Sanger sequencing at chr9:135663795 or chr9:135663804 (both are correct and the difference is due to two different alignments). b Sanger sequencing signal data around the deletion

References

    1. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–1303. doi: 10.1101/gr.107524.110. - DOI - PMC - PubMed
    1. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv. 2012;1207.3907.
    1. Krusche P, Trigg L, Boutros PC, Mason CE, De La Vega FM, Moore BL, Gonzalez-Porta M, Eberle MA, Tezak Z, Lababidi S, et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol. 2019;37(5):555–560. doi: 10.1038/s41587-019-0054-x. - DOI - PMC - PubMed
    1. Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, Irvine SA, Trigg L, Truty R, McLean CY, et al. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 2019;37(5):561–566. doi: 10.1038/s41587-019-0074-6. - DOI - PMC - PubMed
    1. Cameron DL, Di Stefano L, Papenfuss AT. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nature Communications. 2019;10(1):3240. doi: 10.1038/s41467-019-11146-4. - DOI - PMC - PubMed

Publication types

LinkOut - more resources