. 2021 Sep 6;22(1):261.

doi: 10.1186/s13059-021-02472-2.

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Mian Umair Ahsan^#¹, Qian Liu^#¹, Li Fang¹, Kai Wang^{2

3}

Affiliations

¹ Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
² Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA. wangk@email.chop.edu.
³ Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA. wangk@email.chop.edu.

^# Contributed equally.

PMID: 34488830
PMCID: PMC8419925
DOI: 10.1186/s13059-021-02472-2

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Mian Umair Ahsan et al. Genome Biol. 2021.

. 2021 Sep 6;22(1):261.

doi: 10.1186/s13059-021-02472-2.

Authors

Mian Umair Ahsan^#¹, Qian Liu^#¹, Li Fang¹, Kai Wang^{2

3}

Affiliations

¹ Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
² Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA. wangk@email.chop.edu.
³ Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA. wangk@email.chop.edu.

^# Contributed equally.

PMID: 34488830
PMCID: PMC8419925
DOI: 10.1186/s13059-021-02472-2

Abstract

Long-read sequencing enables variant detection in genomic regions that are considered difficult-to-map by short-read sequencing. To fully exploit the benefits of longer reads, here we present a deep learning method NanoCaller, which detects SNPs using long-range haplotype information, then phases long reads with called SNPs and calls indels with local realignment. Evaluation on 8 human genomes demonstrates that NanoCaller generally achieves better performance than competing approaches. We experimentally validate 41 novel variants in a widely used benchmarking genome, which could not be reliably detected previously. In summary, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing.

Keywords: Deep learning; Difficult-to-map regions; Long-range haplotype; Variant calling.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
An example on how to construct input features for a SNP candidate site. a Reference sequence and read pileups at candidate site b and at other genomic positions that share the same reads. The columns in gray are genomic positions that will not be used in input features for candidate site b as they do not satisfy the criteria for being highly likely heterozygous SNP sites. Only the columns with colored bases will be used to generate input features for site b and will constitute the set Z as described in the SNP pileups generation section of "Methods". These neighboring likely heterozygous sites can be up to thousands of bases away from candidate site b. b Reference sequence and read pileups for only the candidate site and neighboring highly likely heterozygous SNP sites. c Raw counts of bases at sites in the set Z for each read group split by the nucleotide types at site b. These raw counts are multiplied with negative signs for reference bases. d Flattened pileup image with fifth channel after reference sequence row is added. e Pileup image used as input features for NanoCaller deep convolutional neural network

**Fig. 2**
An example on how to construct input features for an indel candidate site. a Reference sequence and read pileup at the candidate site before and after multiple sequence alignment, and the consensus sequence. b Reference sequence and consensus sequence at the candidate site before and after pairwise alignment, and the inferred sequence. c Raw counts of each symbol at each column of multiple sequence alignment pileup. d Matrix M, showing frequency of each symbol at each column of multiple sequence alignment pileup. e First channel of input image, matrix M minus Q (one-hot encoding of realigned reference sequence). f Matrix Q, one-hot encoding of realigned reference sequence which forms the second channel of input image for NanoCaller deep convolutional neural network

**Fig. 3**
Performances of NanoCaller and other variant callers on ten ONT datasets. SNP performance on whole-genome high-confidence intervals: a precision, b recall, c F1 score. F1 scores of SNP performances on d “all difficult-to-map” regions and e MHC. Indel performance non-homopolymer regions: f precision, g recall, h F1 score. i: F1 score of indel performance in whole-genome high-confidence intervals

**Fig. 4**
Performances of NanoCaller and other variant callers on four PacBio CCS and four PacBio CLR datasets. SNP performance on whole-genome high-confidence intervals using CCS reads: a precision, b recall, c F1 score. Indel performance on whole-genome high-confidence intervals using CCS reads: d precision, e recall, f F1 score. SNP performance on whole-genome high-confidence intervals using CLR reads: g precision, h recall, i F1 score

**Fig. 5**
Evidence for a novel multiallelic SNP. a IGV plots of Nanopore, PacBio CCS, and Illumina reads of HG002 genome at chr3:5336450-5336480. b Sanger sequencing signal data for the same region. NanoCaller on older HG002 data correctly identified the multiallelic SNP at chr3:5336450 (A>T,C) shown in black box

**Fig. 6**
Evidence for novel deletions. a IGV plots of Nanopore, PacBio CCS, and Illumina reads of HG002 genome at chr9:135663780-135663850. The 40-bp-long deletion shown below in black box was identified using Sanger sequencing at chr9:135663795 or chr9:135663804 (both are correct and the difference is due to two different alignments). b Sanger sequencing signal data around the deletion

See this image and copyright information in PMC

References

1. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–1303. doi: 10.1101/gr.107524.110. - DOI - PMC - PubMed
1. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv. 2012;1207.3907.
1. Krusche P, Trigg L, Boutros PC, Mason CE, De La Vega FM, Moore BL, Gonzalez-Porta M, Eberle MA, Tezak Z, Lababidi S, et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol. 2019;37(5):555–560. doi: 10.1038/s41587-019-0054-x. - DOI - PMC - PubMed
1. Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, Irvine SA, Trigg L, Truty R, McLean CY, et al. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 2019;37(5):561–566. doi: 10.1038/s41587-019-0074-6. - DOI - PMC - PubMed
1. Cameron DL, Di Stefano L, Papenfuss AT. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nature Communications. 2019;10(1):3240. doi: 10.1038/s41467-019-11146-4. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 GM132713/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Affiliations

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources