Lean and deep models for more accurate filtering of SNP and INDEL variant calls
- PMID: 31830260
- DOI: 10.1093/bioinformatics/btz901
Lean and deep models for more accurate filtering of SNP and INDEL variant calls
Abstract
Summary: We investigate convolutional neural networks (CNNs) for filtering small genomic variants in short-read DNA sequence data. Errors created during sequencing and library preparation make variant calling a difficult task. Encoding the reference genome and aligned reads covering sites of genetic variation as numeric tensors allows us to leverage CNNs for variant filtration. Convolutions over these tensors learn to detect motifs useful for classifying variants. Variant filtering models are trained to classify variants as artifacts or real variation. Visualizing the learned weights of the CNN confirmed it detects familiar DNA motifs known to correlate with real variation, like homopolymers and short tandem repeats (STR). After confirmation of the biological plausibility of the learned features we compared our model to current state-of-the-art filtration methods like Gaussian Mixture Models, Random Forests and CNNs designed for image classification, like DeepVariant. We demonstrate improvements in both sensitivity and precision. The tensor encoding was carefully tailored for processing genomic data, respecting the qualitative differences in structure between DNA and natural images. Ablation tests quantitatively measured the benefits of our tensor encoding strategy. Bayesian hyper-parameter optimization confirmed our notion that architectures designed with DNA data in mind outperform off-the-shelf image classification models. Our cross-generalization analysis identified idiosyncrasies in truth resources pointing to the need for new methods to construct genomic truth data. Our results show that models trained on heterogenous data types and diverse truth resources generalize well to new datasets, negating the need to train separate models for each data type.
Availability and implementation: This work is available in the Genome Analysis Toolkit (GATK) with the tool name CNNScoreVariants (https://github.com/broadinstitute/gatk).
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Similar articles
-
A universal SNP and small-indel variant caller using deep neural networks.Nat Biotechnol. 2018 Nov;36(10):983-987. doi: 10.1038/nbt.4235. Epub 2018 Sep 24. Nat Biotechnol. 2018. PMID: 30247488
-
dv-trio: a family-based variant calling pipeline using DeepVariant.Bioinformatics. 2020 Jun 1;36(11):3549-3551. doi: 10.1093/bioinformatics/btaa116. Bioinformatics. 2020. PMID: 32315409
-
HELLO: improved neural network architectures and methodologies for small variant calling.BMC Bioinformatics. 2021 Aug 14;22(1):404. doi: 10.1186/s12859-021-04311-4. BMC Bioinformatics. 2021. PMID: 34391391 Free PMC article.
-
Toward better understanding of artifacts in variant calling from high-coverage samples.Bioinformatics. 2014 Oct 15;30(20):2843-51. doi: 10.1093/bioinformatics/btu356. Epub 2014 Jun 27. Bioinformatics. 2014. PMID: 24974202 Free PMC article. Review.
-
Calling Variants in the Clinic: Informed Variant Calling Decisions Based on Biological, Clinical, and Laboratory Variables.Comput Struct Biotechnol J. 2019 Apr 8;17:561-569. doi: 10.1016/j.csbj.2019.04.002. eCollection 2019. Comput Struct Biotechnol J. 2019. PMID: 31049166 Free PMC article. Review.
Cited by
-
Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges.Brief Bioinform. 2024 Jan 22;25(2):bbad508. doi: 10.1093/bib/bbad508. Brief Bioinform. 2024. PMID: 38271481 Free PMC article. Review.
-
Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics.Bioinformatics. 2023 Dec 1;39(12):btad694. doi: 10.1093/bioinformatics/btad694. Bioinformatics. 2023. PMID: 38019945 Free PMC article.
-
Bioinformatics: From NGS Data to Biological Complexity in Variant Detection and Oncological Clinical Practice.Biomedicines. 2022 Aug 24;10(9):2074. doi: 10.3390/biomedicines10092074. Biomedicines. 2022. PMID: 36140175 Free PMC article. Review.
-
Deep Learning for Biomarker Discovery in Cancer Genomes.bioRxiv [Preprint]. 2025 Jan 8:2025.01.06.631471. doi: 10.1101/2025.01.06.631471. bioRxiv. 2025. PMID: 39829845 Free PMC article. Preprint.
-
Autosomal recessive variants c.953A>C and c.97-1G>C in NSUN2 causing intellectual disability: a molecular dynamics simulation study of loss-of-function mechanisms.Front Neurol. 2023 May 25;14:1168307. doi: 10.3389/fneur.2023.1168307. eCollection 2023. Front Neurol. 2023. PMID: 37305761 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources