Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 1;30(9):1198-204.
doi: 10.1093/bioinformatics/btt750. Epub 2014 Jan 16.

Subclonal variant calling with multiple samples and prior knowledge

Affiliations

Subclonal variant calling with multiple samples and prior knowledge

Moritz Gerstung et al. Bioinformatics. .

Abstract

Motivation: Targeted resequencing of cancer genes in large cohorts of patients is important to understand the biological and clinical consequences of mutations. Cancers are often clonally heterogeneous, and the detection of subclonal mutations is important from a diagnostic point of view, but presents strong statistical challenges.

Results: Here we present a novel statistical approach for calling mutations from large cohorts of deeply resequenced cancer genes. These data allow for precisely estimating local error profiles and enable detecting mutations with high sensitivity and specificity. Our probabilistic method incorporates knowledge about the distribution of variants in terms of a prior probability. We show that our algorithm has a high accuracy of calling cancer mutations and demonstrate that the detected clonal and subclonal variants have important prognostic consequences.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
General illustration of our approach. (a) Distribution of observed and expected VAFs across samples. The histograms denote the VAF formula image and formula image of a recurrent artifact occurring at low frequencies in ∼20% of the samples in forward, but not in the reverse orientation. The solid lines denote the expected distribution based on a beta-binomial model, Equation (1), with mean formula image and formula image defined as the average across all samples with VAF formula image. The third histogram denotes the SF3B1 K700E variant present at clonal and subclonal frequencies, with the curve denoting the expected frequency distribution. (b) Heatmap of 1000 nt from five adjacent bait sets targeting the SF3B1 gene in 683 samples. The intensity of each pixel represents VAF of cytosine, formula image, in a given sample (y, left axis) and position (x). If the relative frequency is identical, pixels tend to be black. Curves on the bottom indicate the error rates formula image and formula image in forward and reverse directions (right y-axis). The black line is the estimated dispersion formula image. The prior π of finding a true variant is derived from the COSMIC database. Circles are drawn around variants with a posterior formula image; the area of each circle is proportional to the VAF. At position 650 resides the K700E hotspot mutation with many variant calls. (c–f) Bayes factors [Equation (7)] as a function of forward (x) and reverse (y) allele counts for different error rates formula image and dispersions formula image. (g) A variant-specific prior π influences the Bayes factor needed to call a variant at a given cutoff on the posterior probability, Equation (9)
Fig. 2.
Fig. 2.
Variant calling in control data. (a) Power (true-positive rate) of detecting variants with different frequency and coverage for fixed dispersion ρ. (b) Power of detecting variants when ρ is estimated from the data using a VAF cutoff of 0.1. (c) AUC as a function of cohort size for different variant allele frequencies. The two lines for each VAF refer to the case formula image and to the case formula image, respectively. (d) Specificity of different algorithms on 32 normal control samples. (e) Scatterplot of Bayes factors for 20 replicates. Colors denote variants meeting a posterior threshold of 0.5 in only one of the two replicates. Open circles are known polymorphisms. (f) Concordance of variant calls as a function of the posterior cutoff. Filled segments show the number of variants called in either of the two replicates (top and bottom; left axis) and the overlapping fraction (middle) when a given posterior cutoff is applied. The black line (right axis) shows the relative proportion of overlapping to total calls
Fig. 3.
Fig. 3.
Variants in MDS. (a) Number of non-polymorphic variant calls versus cutoff P0 and prior weight formula image. (b) Ratio of non-silent to silent variant calls. (c) Venn diagram of the distribution of shearwater variants across a normal panel, known SNPs and COSMIC variants. (d) Distribution of variant allele frequencies. (e) Venn diagram of calls from different algorithms. (f) Number of SF3B1 K700E calls as a function of false positives for different variant callers
Fig. 4.
Fig. 4.
Prognostic effect of different variant callers. (a–d) The fraction of AML-free patients (either death or AML transformation) versus the time in months after sampling is shown. Patients are split into groups depending on whether the patient has a non-silent mutation in the given gene, found exclusively by Caveman, by shearwater only or by both. The gray line denotes patients with no mutations. P-values in the caption are from a log-rank test against the wild-type group, C is the corresponding C-statistic. While the Kaplan–Meyer curves and N refer to the fraction of patients exclusive to each method, P and C include the joint cases. (e) C-statistic for shearwater for different parameters. (f) C-statistic under permutation tests shuffling all calls in the set of variants exclusive to one variant caller. (g) C-statistic for different AND combinations of genotypes. (h) C-statistic for different OR combinations of genotypes

References

    1. Cibulskis K, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 2013;31:213–219. - PMC - PubMed
    1. Damm F, et al. Mutations affecting mRNA splicing define distinct clinical phenotypes and correlate with patient outcome in myelodysplastic syndromes. Blood. 2012;119:3211–3218. - PubMed
    1. Forbes SA, et al. COSMIC: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res. 2011;39:D945–D950. - PMC - PubMed
    1. Gerstung M, et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 2012;3:811. - PubMed
    1. Goya R, et al. SNVMix: predicting single nucleotide variants from next generation sequencing of tumors. Bioinformatics. 2010;26:730–736. - PMC - PubMed

Publication types