Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec;48(12):1455-1461.
doi: 10.1038/ng.3697. Epub 2016 Oct 24.

Genome-wide significance testing of variation from single case exomes

Affiliations

Genome-wide significance testing of variation from single case exomes

Amy B Wilfert et al. Nat Genet. 2016 Dec.

Abstract

Standard techniques from genetic epidemiology are ill-suited to formally assess the significance of variants identified from a single case. We developed a statistical inference framework for identifying unusual functional variation from a single exome or genome, what we refer to as the 'n-of-one' problem. Using this approach we assessed our ability to identify the causal genotypes in over 5 million simulated cases of Mendelian disease, identifying 39% of disease genotypes as the most damaging unit in a typical exome background. We applied our approach to 129 n-of-one families from the Undiagnosed Diseases Program, nominating 60% of 30 disease genes determined to be diagnostic by a standard clinical workup. Our method can currently produce well-calibrated P values when applied to single genomes, can facilitate integration of multiple data types for n-of-one analyses, and, with further work, could become a widely used epidemiological method like linkage analysis or genome-wide association analysis.

PubMed Disclaimer

Conflict of interest statement

D.F.C is funded by a research contract with PierianDx to develop novel methods for clinical exome analysis.

Figures

Figure 1
Figure 1. Approach to the n-of-one problem
We break the n-of-one problem down into two parts: a, the variant annotation problem, which is to replace a generic “reference” and “alternate” allele labeling with a score that reflects the potential of each variant for damaging molecular function, and b, significance testing-evaluating the likelihood that a pair of haplotypes observed in the n-of-one case contains the causal genotype(s), conditional on their variant annotations. Standard approaches in genetic epidemiology, such as GWAS and linkage analysis, typically evaluate the relative likelihood of the data under a null and alternative model. In most rare disease situations, it is infeasible to define a likelihood function that reflects the likelihood of the data when sampled from the disease population (ie the alternative model). Thus, our approach is to simplify significance testing to evaluating the likelihood of the patient’s data under the null. c, To implement our approach, three test statistics are constructed for an annotated haplotype pair corresponding to each of three disease models: autosomal dominant, single variant recessive and compound heterozygous disease models. The test statistic for each disease model is then evaluated using gene-specific null models parameterized by population genetic data from over 61,000 exomes. Note that the linear depiction of alleles is not meant to imply that phase is known or required in the implementation of the method as described here. Integration of phase information should be helpful but would require a phased population reference to implement.
Figure 2
Figure 2. PSAP calibration
We calculated PSAP p-values for each disease model using exome data from three independent control cohorts (A, WHI; B, GTEx; C, Swedish population controls). For each combination of cohort and disease model, we have pooled all PSAP p-values from all individuals (e.g. n individuals x k genes = nk total PSAP p-values) and plotted these against quantiles of the uniform distribution. Row (A): using data from the WHI, we compared p-value calibration from 189 European American (EA) individuals to that of 189 African American (AA) individuals for each of the three disease models evaluated in the main text. QQ-plots show that p-values from our null model are well calibrated when calculated using control exomes of European descent, but are sensitive to population structure. (B) QQ-plots for 418 European Americans sequenced by the GTEx project. (C) QQ-plots for 2,200 Swedish population controls. In each panel, the grey shaded area represents a conservative 95% CI for the expected distribution of p-values. These are computed from the beta distribution, assuming independent tests, and are dependent on the number of tests being plotted. For each combination of disease model and cohort, over 97% of PSAP p-values fall within a factor of 2 of these confidence intervals. The most poorly calibrated distribution was the AR model for Swedish population controls –the maximum deviation from the expected p-values reached one order of magnitude in the range of 10−4. AD=autosomal dominant disease model; AR=autosomal recessive disease model; CHET= compound heterozygote disease model.
Figure 3
Figure 3. Factors influencing PSAP
Three primary sources of information contribute to the performance of PSAP values- the use of gene specific-models, modeling gene-specific singleton rates and integration of frequency information from ExAC. A We illustrate the advantage of gene-specific models using the well-studied disease gene PTEN. Using 1 million simulated individuals, we constructed null models for the worst CADD score across the exome (black line) and within the gene PTEN (blue line). CADD scores for 223 known disease variants within the PTEN gene are indicated with black hashmarks along the x-axis. The significance of most known PTEN disease mutations are on the order of 0.01 when evaluated with the genomewide null, but many have significance of 10−6 when evaluated with a PTEN-specific null. B. The probability of observing a heterozygous genotype with a CADD score greater than 5 or 35 is plotted against the gene-specific singleton rate for each gene in the genome. C. The minimum observed PSAP value for a gene is constrained by the frequency of other apparently deleterious variants in that gene. For each of 396 autosomal recessive genes, we calculated the PSAP values for all homozygous mutations reported in HGMD. We then compared the minimum HGMD PSAP for each gene to the maximum ExAC frequency of an apparently deleterious variant in the same gene. Here “apparently deleterious” is defined as a variant with an equal or greater CADD score than the worst HGMD disease mutation for each gene.
Figure 4
Figure 4. Benchmarking our ability to identify the causal gene in simluated n-of-one cases
(A) We simulated 4,832,163 cases of recessive disease by adding pairs of heterozygous disease genotypes to the real exome genotype calls from healthy individuals (Methods). For each simulated disease exome, genes were rank ordered according to (i) PSAP p-value (CHET model), (ii) a gene-based statistic using CADD, or (iii) the combination of CHET PSAP p-value and CADD. We also plot the performance of each statistic when applied to the subset of exome variants with minor allele frequency < 1% in each simulated case (“ + AF”). The rank of the disease gene within the simulated case is shown on the y-axis, while the simulated disease genes are ranked against each other based on the CADD scores of their disease mutations (x-axis). Analogous plots for the autosomal dominant and autosomal recessive disease models are shown in Figure S10. (B) We constructed ROC curves that quantify the ability of PSAP-MIN to discriminate cases from controls using a set of 1,500 simulated cases and 1,500 ethnicity-matched controls. Here we show the ROC curves for PSAP-MIN when evaluating recessive disease mutations under the compound heterozygote disease model-the area under the curve was 0.76 when evaluating all genes in the genome (n=19,068 genes), 0.84 when evaluating only those genes with strong selective constraint (RVIS<0, n=8,146 genes) and 0.95 when restricting analysis to recessive disease genes in HGMD (n=507). Full results from all three disease models are in Table 2.
Figure 5
Figure 5. Application of PSAP to real cases of the n=1 problem
(A). In a previous cohort study examining homozygous-by-descent regions (HBDRs) in 300 cases of male infertility, we identified a case of uniparental isodisomy spanning all of chromosome 2 (histogram). Fewer than 10 cases of this cytogenetic anomaly had ever been reported, all of which were ascertained through studies of Mendelian disease. Inset: the B-allele frequency data from a SNP array run on this case. (B) PSAP analysis of exome sequencing data from this individual revealed an excess of low p-values on chromosome 2, and identified the most unusual genotype in this case to be a homozygous change in the gene INHBB, a biomarker of Sertoli cell function sometimes used in the clinical management of male infertility. Each point in the QQ-plot represents the test of a single gene; genes are colored by genomic location. (C). Enrichment of low PSAP p-values in the UDP families. For each disease model we identified PSAP p-value thresholds corresponding to a 5% false positive rate when applied to population controls (p=1x10−5 for autosomal dominant and autosomal recessive models, p=2.25 x10−5 for CHET model). These barplots indicate the percent of UDP individuals with at least one PSAP p-value less than these thresholds. Results are stratified by disease model and gene set evaluated. Fisher exact tests were used to identify cohorts that were significantly enriched for individuals with low PSAP p-values compared to expectation. * = p-value < 0.05, ** = p-value < 0.01
Figure 6
Figure 6. PSAP facilitates integrative analysis of rare disease patients
(A). One of the strengths of the PSAP statistic, is that, as a gene-based p-value, it can be easily combined with other gene-based measurements that may be informative about the functional relevance of a gene. We manually curated each disease diagnosis for the 30 diagnosed UDP cases, assigning a tissue label for the “primary” affected tissue in each case, and then combined gene expression p-values from the relevant tissue with the PSAP p-value for each gene in the genome, using Fisher’s method for data fusion (Supplementary Note). Here we plot the improvement in rank for each of 30 diagnostic genes when using this new “fused” p-value compared to the use of PSAP alone. (B) It should be feasible to generalize this integrative approach to many sources of molecular information on each gene in the genome. While the causal gene in a rare disease patient may not be definitively identified by a single measurement, joint analysis of a set of measurements may provide better resolution (red dots = measurements on the causal gene).

References

    1. Bamshad MJ, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12:745–755. - PubMed
    1. Gahl WA, et al. The National Institutes of Health Undiagnosed Diseases Program: insights into rare diseases. Genet Med. 2012;14:51–59. - PMC - PubMed
    1. Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet. 2011;12:628–640. - PubMed
    1. Mitchell AA, Chakravarti A, Cutler DJ. On the probability that a novel variant is a disease-causing mutation. Genome Research. 2005;15:960–966. - PMC - PubMed
    1. Richards S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17:405–424. - PMC - PubMed

Publication types