Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 6;13(7):e0200008.
doi: 10.1371/journal.pone.0200008. eCollection 2018.

Utilizing ExAC to assess the hidden contribution of variants of unknown significance to Sanfilippo Type B incidence

Affiliations

Utilizing ExAC to assess the hidden contribution of variants of unknown significance to Sanfilippo Type B incidence

Wyatt T Clark et al. PLoS One. .

Abstract

Given the large and expanding quantity of publicly available sequencing data, it should be possible to extract incidence information for monogenic diseases from allele frequencies, provided one knows which mutations are causal. We tested this idea on a rare, monogenic, lysosomal storage disorder, Sanfilippo Type B (Mucopolysaccharidosis type IIIB). Sanfilippo Type B is caused by mutations in the gene encoding α-N-acetylglucosaminidase (NAGLU). There were 189 NAGLU missense variants found in the ExAC dataset that comprises roughly 60,000 individual exomes. Only 24 of the 189 missense variants were known to be pathogenic; the remaining 165 variants were of unknown significance (VUS), and their potential contribution to disease is unknown. To address this problem, we measured enzymatic activities of 164 NAGLU missense VUS in the ExAC dataset and developed a statistical framework for estimating disease incidence with associated confidence intervals. We found that 25% of VUS decreased the activity of NAGLU to levels consistent with Sanfilippo Type B pathogenic alleles. We found that a substantial fraction of Sanfilippo Type B incidence (67%) could be accounted for by novel mutations not previously identified in patients, illustrating the utility of combining functional activity data for VUS with population-wide allele frequency data in estimating disease incidence.

PubMed Disclaimer

Conflict of interest statement

All authors are full time employees of BioMarin Pharmaceutical Inc. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Figures

Fig 1
Fig 1. Allele frequency based incidence estimates of lysosomal storage disorders in the ExAC non-finnish population.
Starting with pathogenic (DM) variants in HGMD (orange), incidence was estimated using allele frequencies for each successive class of variant, combining mutations with all previous categories. Loss of function (LoF) variants (blue) were selected as those mutations causing either a splice affecting, stop-gained or frameshift change in the coding sequence which had not been documented in HGMD. Likewise, variants of unknown significance (VUS) were selected as those missense mutations in each gene that had yet to be documented in HGMD. Vertical dashed black lines represent the reported incidence rate of each disorder in the literature for the European region for Sanfilippo Type B [9], MPS VI [10], Krabbe [11], Morquio A (average of the UK, Germany, and the Netherlands) [12], MPS I [13], MPS IIIA [9], and CLN2 (average of Sweden, Norway, Finland, Italy, Portugal, Netherlands, the Czech Republic, and Italy) [2].
Fig 2
Fig 2. The enzymatic activity of NAGLU variants.
(A) Variants are ordered by average %wt activity. Standard deviation in %wt activity in replicates is represented by vertical bars. Previously identified disease variants are shown in orange. A dashed line shows the 15%wt activity threshold below which variants are considered to be pathogenic. (Insert B) A box plot (y-axis log scale) of the Non-Finnish European allele frequency of variants with ≤15%wt activity (average allele frequency of 3.76 × 10−5 and those with >15%wt (average allele frequency of 0.0014). The difference in the average allele frequency between the two groups was not statistically significant (p-value = 0.372, t-test).
Fig 3
Fig 3. The structural characteristics of NAGLU variants.
Mutations at N-linked glycosylation sites are shown in cyan, active-site pocket in red, disulfides in orange and trimer interface in blue. (A) Mutations (spheres) with ≤15%wt activity and DM variants in HGMD (left) and >15%wt activity (right) mapped onto a NAGLU monomeric structure (PDB ID 4XWH) (B) The relative solvent exposure in the trimer NAGLU structure of variants with ≤15%wt activity and DM variants in HGMD compared to those with >15%wt activity. ≤15%wt activity and DM variants were found to have an average percent solvent exposure of 5.96%. Variants with >15%wt activity had an average percent solvent exposure of 23.20%. The differences in the average solvent exposure between the two groups was statistically significant at the 5% significance level with a p-value of 1.09 × 10−12. (C) The distance to the active site in the NAGLU structure of variants with ≤15%wt activity and DM variants in HGMD compared to those with >15%wt activity. ≤15%wt activity and DM variants were found to have an average distance to the active site of 24.03Å. Variants with >15%wt activity had an average distance to the active site of 27.23Å. The differences in the average distance to the active site between the two groups was statistically significant at the 5% significance level with a p-value of.001.
Fig 4
Fig 4. The statistical power of ExAC.
(A) Blue lines show the percent margin of error when calculating incidence using the overall sample size of ExAC given a 95% confidence interval. Red lines show the percent margin of error when calculating incidence using the sample size of the Non-Finnish European cohort in ExAC given a 95% confidence interval. Upper and lower limits of the confidence intervals are shown as dotted and solid lines respectively. The vertical grey dashed line represents the 1 in 321,128 Sanfilippo Type B incidence rate estimated by Heron et al. [9]. (B) The number of individuals which should be sequenced such that the lower critical value will represent a 20% margin of error given a 95% confidence interval for a range of estimated incidence values.
Fig 5
Fig 5. Allele frequency based Sanfilippo Type B incidence estimates.
(A) The impact of each category of variants on our estimate of Sanfilippo Type B incidence. The contribution of HGMD variants, sorted by allele frequency, is shown in orange. The contribution of LoF variants which have yet to be documented in patients, also sorted by allele frequency, is shown in light blue. The contribution of VUS with %wt activity values below 15% is shown in grey. VUS are sorted by %wt values. Percentages in parenthesis represent the contribution of each category of variants to our final incidence estimate. (B) Confidence interval calculations. Grey bars represent the distribution of incidence rates observed through bootstrapping simulation. The orange solid line represents the distribution of incidence rates as modeled using the beta distribution. Using bootstrapping simulation we observed that 95% of simulated incidence values fell in the range of 1 in 558,306 and 1 in 241,749 for an equal tailed interval (vertical black lines). In comparison, using the beta-distribution we estimated that 95% of incidence values would fall in the range of 1 in 566,863 to 1 in 243,753 (vertical dashed orange lines). Using the normal approximation we estimated that 95% of incidence values would fall in the range of 1 in 610,093 to 1 in 250,830 (vertical dashed blue lines).
Fig 6
Fig 6. Performance of in silico predictors.
(A) A comparison between the observed enzymatic activity for each variant, sorted from lowest to highest %wt activity, to the binary predictions from PolyPhen and SIFT, on the third row. A red vertical line represents the division between missense mutations with ≤15% wt activity and those with >15% wt activity. (B) A Venn diagram showing the agreement between VUS observed to have ≤15% wt in our enzymatic activity assay and VUS categorized as “deleterious” by SIFT or “probably damaging” by PolyPhen. (C) Non-Finnish European incidence estimates obtained when considering only HGMD and LoF mutations (1 in 1,091,549), and when combining VUS with ≤15% wt activity (1 in 355,502), VUS categorized as “deleterious” by SIFT (1 in 255,344), VUS categorized as “probably damaging” by PolyPhen (1 in 288,327), or variants categorized as “deleterious” by SIFT and “probably damaging” by PolyPhen (1 in 438,641) with HGMD+LoF variants. VUS with an allele frequency in ExAC greater than 0.1% or for which one or more homozygous individuals were observed were excluded from estimates using SIFT and PolyPhen.

References

    1. Moyer VA, Calonge N, Teutsch SM, Botkin JR. Expanding newborn screening: process, policy, and priorities. Hastings Center Report. 2008;38(3):32–39. doi: 10.1353/hcr.0.0011 - DOI - PubMed
    1. Sleat DE, Gedvilaite E, Zhang Y, Lobel P, Xing J. Analysis of large-scale whole exome sequencing data to determine the prevalence of genetically-distinct forms of neuronal ceroid lipofuscinosis. Gene. 2016;593(2):284–291. doi: 10.1016/j.gene.2016.08.031 - DOI - PMC - PubMed
    1. Schrodi SJ, DeBarber A, He M, Ye Z, Peissig P, Van Wormer JJ, et al. Prevalence estimation for monogenic autosomal recessive diseases using population-based genetic data. Human genetics. 2015;134(6):659–669. doi: 10.1007/s00439-015-1551-8 - DOI - PubMed
    1. Hopp K, Cogal AG, Bergstralh EJ, Seide BM, Olson JB, Meek AM, et al. Phenotype-genotype correlations and estimated carrier frequencies of primary hyperoxaluria. Journal of the American Society of Nephrology. 2015; p. JASN-2014070698. - PMC - PubMed
    1. Crow JF, Kimura M. An introduction to population genetics theory. An introduction to population genetics theory. 1970;.

Publication types

Substances

LinkOut - more resources