Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun;41(6):870-877.
doi: 10.1038/s41587-022-01559-w. Epub 2023 Jan 2.

Control-independent mosaic single nucleotide variant detection with DeepMosaic

Collaborators, Affiliations

Control-independent mosaic single nucleotide variant detection with DeepMosaic

Xiaoxu Yang et al. Nat Biotechnol. 2023 Jun.

Abstract

Mosaic variants (MVs) reflect mutagenic processes during embryonic development and environmental exposure, accumulate with aging and underlie diseases such as cancer and autism. The detection of noncancer MVs has been computationally challenging due to the sparse representation of nonclonally expanded MVs. Here we present DeepMosaic, combining an image-based visualization module for single nucleotide MVs and a convolutional neural network-based classification module for control-independent MV detection. DeepMosaic was trained on 180,000 simulated or experimentally assessed MVs, and was benchmarked on 619,740 simulated MVs and 530 independent biologically tested MVs from 16 genomes and 181 exomes. DeepMosaic achieved higher accuracy compared with existing methods on biological data, with a sensitivity of 0.78, specificity of 0.83 and positive predictive value of 0.96 on noncancer whole-genome sequencing data, as well as doubling the validation rate over previous best-practice methods on noncancer whole-exome sequencing data (0.43 versus 0.18). DeepMosaic represents an accurate MV classifier for noncancer samples that can be implemented as an alternative or complement to existing methods.

PubMed Disclaimer

Conflict of interest statement

Competing Interests Statement:

L.B.A. is a compensated consultant and has equity interest in io9, LLC. His spouse is an employee of Biotheranostics, Inc. L.B.A. is an inventor of a US Patent 10,776,718 and he also declares U.S. provisional applications with serial numbers: 63/289,601; 63/269,033; 63/366,392 and 63/367,846. All other authors declare no competing interests.

Figures

Fig. 1|
Fig. 1|. Image representation, model training strategies, and framework of DeepMosaic.
a, DeepMosaic-VM: Composite RGB image representation of sequenced reads separated into “Ref” - reads supporting the reference allele; or “Alts’’ - reads supporting alternative alleles; each outlined in yellow. b, Red channel of the compound image contains base information from the BAM file. “D” - deletion; “A” – Adenine; “C” – cytosine; “G” – guanine; “T” – thymine; “N” – low-quality base. Yellow box: Var: candidate position, centered in the image. c, Green channel: base quality information. Note that channel intensity was modulated in this example for better visualization. d, Blue channel: strand information (i.e. forward or reverse). e, Model training, model selection, and overall benchmark strategy for DeepMosaic-CM (Methods and Supplementary Fig. 1). Ten different convolutional neural network models were trained on 180,000 experimentally validated positive and negative biological variants from 29 WGS data from 6 individuals sequenced at 100x, (BioData1), as well as simulated data with different AFs (SimData1) resampled to a different depth. Models were evaluated based upon an independent gold-standard biological dataset from the 250x WGS data of the Reference Tissue Project of the Brain Somatic Mosaicism Network (BioData2) as well as an independent 300x WGS dataset from the Brain Somatic Mosaicism Network Capstone project (Biodata3). DeepMosaic was further benchmarked on 16 independent biological datasets from 200x WGS data (BioData4), on 181 independently generated 300x noncancer WES data (BioData5), 2430 TCGA-MC3 WES samples (BioData6), as well as 619,740 independently simulated variants (SimData2 and SimData3). Deep amplicon sequencing was carried out as an independent evaluation of variants detected by different software (Supplement Table 1). f, Application of DeepMosaic-CM in practice. Input images are generated from the candidate variants. 16 convolutional layers extracted information from input images. Population genomic features were assembled for the final output. Images of positive and negative variants are shown as examples. Conv: convolutional layers; MBConv: mobile convolutional layers.
Fig. 2|
Fig. 2|. DeepMosaic performance on simulated benchmark variants.
a, Benchmark test on 180,540 genomic positions (SimData3) generated by replacing reads from biological data with simulated MVs. DeepMosaic showed higher accuracy, F1 score, MCC (Matthews correlation coefficient), sensitivity, and comparable specificity compared with widely accepted methods for mosaic variant detection, specificity of all callers are close to 1. b, Sensitivity of DeepMosaic and other mosaic callers on SimData3 at simulated read depths and AFs. DeepMosaic performed equally well or better than other tested methods, especially at lower read depths and lower expected AFs. Variant-stabilized square-root transformation was used for visualization purposes.
Fig. 3|
Fig. 3|. DeepMosaic performance validated on biological data.
a, DeepMosaic and other mosaic variant detection methods were applied to 200x whole-genome sequencing data from 16 samples, which were not used in the training or validation stage for any of the listed methods (BioData4). Raw variant lists were either obtained by comparing samples using a panel-of-normal strategy with MuTect2 single mode, between different samples from the same individual using MuTect2 paired mode or Strelka2 somatic mode or detected directly without control with MosaicHunter single mode with heuristic filters. A total of 46,928 candidate variants from MuTect2 single mode were analyzed by DeepMosaic and MosaicForecast. Orthogonal validation with deep amplicon sequencing was carried out on a total of 239 variants out of the 1146 candidates called by at least one method. b, Distribution of AFs of the whole candidate mosaic variant list and the 239 experimentally quantified variants. c, Comparison of validation results between different mosaic variant calling methods, ‘UpSet’ plot shows the intersection of different mosaic detection methods and the validation result of each category. Variants identified by DeepMosaic showed high sensitivity and specificity to biological data. d, Comparison of validation rate in different AF range percentage bins of variants. DeepMosaic showed the highest validation rate at a range of AFs, approximately 48 experimentally validated variants are shown in each AF bin. e, Comparision of experimental validation rate of DeepMosaic on WGS (BioData4) and WES (BioData5) outperforms other computational pipelines.

References

    1. Dou Y, Gold HD, Luquette LJ & Park PJ Detecting Somatic Mutations in Normal Cells. Trends in genetics : TIG 34, 545–557 (2018). - PMC - PubMed
    1. Biesecker LG & Spinner NB A genomic view of mosaicism and human disease. Nature reviews. Genetics 14, 307–320 (2013). - PubMed
    1. Lee JH et al. Human glioblastoma arises from subventricular zone cells with low-level driver mutations. Nature 560, 243–247 (2018). - PubMed
    1. Yang X et al. MosaicBase: A Knowledgebase of Postzygotic Mosaic Variants in Noncancer Disease-related and Healthy Human Individuals. Genomics Proteomics Bioinformatics 18, 140–149 (2020). - PMC - PubMed
    1. Poduri A, Evrony GD, Cai X & Walsh CA Somatic mutation, genomic variation, and neurological disease. Science 341, 1237758 (2013). - PMC - PubMed

Methods-only references

    1. Xia Y, Liu Y, Deng M & Xi R Pysim-sv: a package for simulating structural variation data with GC-biases. BMC bioinformatics 18, 53 (2017). - PMC - PubMed
    1. Koressaar T & Remm M Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007). - PubMed
    1. Hansen RS et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc Natl Acad Sci U S A 107, 139–144 (2010). - PMC - PubMed
    1. Chung C et al. Comprehensive multiomic profiling of somatic mutations in malformations of cortical development. Nature Genetics Accepted in principle (2022). - PMC - PubMed

Publication types