Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 15;16(7):e1008922.
doi: 10.1371/journal.pgen.1008922. eCollection 2020 Jul.

Unified inference of missense variant effects and gene constraints in the human genome

Affiliations

Unified inference of missense variant effects and gene constraints in the human genome

Yi-Fei Huang. PLoS Genet. .

Erratum in

Abstract

A challenge in medical genomics is to identify variants and genes associated with severe genetic disorders. Based on the premise that severe, early-onset disorders often result in a reduction of evolutionary fitness, several statistical methods have been developed to predict pathogenic variants or constrained genes based on the signatures of negative selection in human populations. However, we currently lack a statistical framework to jointly predict deleterious variants and constrained genes from both variant-level features and gene-level selective constraints. Here we present such a unified approach, UNEECON, based on deep learning and population genetics. UNEECON treats the contributions of variant-level features and gene-level constraints as a variant-level fixed effect and a gene-level random effect, respectively. The sum of the fixed and random effects is then combined with an evolutionary model to infer the strength of negative selection at both variant and gene levels. Compared with previously published methods, UNEECON shows improved performance in predicting missense variants and protein-coding genes associated with autosomal dominant disorders, and feature importance analysis suggests that both gene-level selective constraints and variant-level predictors are important for accurate variant prioritization. Furthermore, based on UNEECON, we observe a low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations, which can be partially explained by the prevalence of disordered protein regions that are highly tolerant to missense mutations. Finally, we show that genes intolerant to both missense and loss-of-function mutations play key roles in the central nervous system and the autism spectrum disorders. Overall, UNEECON is a promising framework for both variant and gene prioritization.

PubMed Disclaimer

Conflict of interest statement

The author has declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of the UNEECON model.
UNEECON estimates negative selection on missense mutation i in gene j based on the relative probability of the occurrence of the missense mutation, ηij, compared to the occurrence probability of neutral mutations, μij. ηij depends on the sum of a variant-level fixed effect, zij, and a gene-level random effect, uj. We assume that zij captures the contribution of variant-level features, Xij, to negative selection, and model the relationship between Xij and zij with a feedforward neural network. We assume that uj is a Gaussian random variable modeling the gene-level variation of selective constraints that cannot be predicted from variant features. The sum of zij and uj is then sent to a logistic function to obtain ηij. The neutral occurrence probability, μij, is from a context-dependent mutation model trained on putatively neutral mutations. Free parameters of the UNEECON model are estimated by minimizing the discrepancy between the predicted occurrence probability, ηijμij, and the observed occurrence of each potential missense mutation in the gnomAD exome sequencing data [29].
Fig 2
Fig 2. Distributions of UNEECON scores across potential missense mutations.
(a) Distributions of UNEECON scores estimated for potential missense mutations in haploinsufficient (HI) genes [37], autosomal dominant disease genes [35, 36], autosomal recessive disease genes [35, 36], and olfactory receptor genes [45]. (b) Distributions of UNEECON scores estimated for potential missense mutations in various protein regions. The functional sites and protein secondary structures are based on UniProt annotations [47]. The predicted disordered protein regions are from MobiDB [48]. (c) Average UNEECON scores estimated for all codon positions in the CDKL5 protein. Each grey dot represents the UNEECON score averaged over all missense mutations in a codon position. Blue curve represents the locally estimated scatterplot smoothing (LOESS) fit. Blue and red dots represent pathogenic and benign missense variants from ClinVar [30], respectively. The horizontal line represents a constrained region reported in a previous study [25].
Fig 3
Fig 3. Predictive power of various methods for distinguishing pathogenic missense variants from benign missense variants.
(a) Performance in predicting autosomal dominant pathogenic variants from ClinVar [30]. True positive and true negative rates correspond to the fractions of pathogenic and benign variants exceeding various thresholds, respectively. AUC corresponds to the area under the receiver operating characteristic curve. (b) Enrichment of predicted deleterious de novo variants in individuals affected by developmental disorders [31]. The y-axis corresponds to the log2 odds ratio of the enrichment of predicted deleterious variants in the affected individuals for a given percentile threshold. The x-axis corresponds to the various percentile threshold values used in the enrichment analysis. Error bars represent the standard error of the log2 odds ratio.
Fig 4
Fig 4. Predictive power of various methods for distinguishing disease and essential genes from genes not likely to have strong phenotypic effects.
(a) Performance in predicting autosomal dominant disease genes [35, 36]. (b) Performance in predicting haploinsufficient genes [37]. (c) Performance in predicting human orthologs of mouse essential genes [33, 34]. (d) Performance in predicting human essential genes in cell lines [32]. True positive and true negative rates correspond to the fractions of positive and negative genes exceeding various thresholds, respectively. AUC corresponds to the area under the receiver operating characteristic curve.
Fig 5
Fig 5. Distributions of gene-level intolerance to missense and to loss-of-function mutations.
(a) Correlation between gene-level intolerance to missense mutations (UNEECON-G score) and that to loss-of-function (LOF) mutations (pLI score). Blue dots represent 956 genes intolerant to both missense and LOF mutations. Red dots represent 956 genes tolerant to missense but not to loss-of-function mutations. (b) Distribution of protein disorder content in the gene sets intolerant to loss-of-function mutations. (c) Enrichment of Reactome pathways in the gene set intolerant to both missense and loss-of-function mutations. The gene set tolerant to missense but not to loss-of-function mutations is used as a background. Only the highest-level Reactome terms from the PANTHER hierarchy view are included in the visualization. The term “unclassified” indicates that the corresponding genes have no known or inferred function. A fold enrichment below 1 indicates a depletion in the gene set intolerant to both missense and loss-of-function mutations, or equivalently, an enrichment in the gene set tolerant to missense but not to loss-of-function mutations. (d) Enrichment of autism genes in the gene sets intolerant to loss-of-function mutations. Error bars represent the standard error of the log2 odds ratio.

References

    1. Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine. 2015;17(5):405–423. 10.1038/gim.2015.30 - DOI - PMC - PubMed
    1. Maxwell K, Hart S, Vijai J, Schrader K, Slavin T, Thomas T, et al. Evaluation of ACMG-Guideline-Based Variant Classification of Cancer Susceptibility and Non-Cancer-Associated Genes in Families Affected by Breast Cancer. The American Journal of Human Genetics. 2016;98(5):801–817. 10.1016/j.ajhg.2016.02.024 - DOI - PMC - PubMed
    1. Eilbeck K, Quinlan A, Yandell M. Settling the score: variant prioritization and Mendelian disease. Nature Reviews Genetics. 2017;18(10):599–612. 10.1038/nrg.2017.52 - DOI - PMC - PubMed
    1. Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Research. 2011;39(17):e118 10.1093/nar/gkr407 - DOI - PMC - PubMed
    1. Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research. 2003;31(13):3812–3814. 10.1093/nar/gkg509 - DOI - PMC - PubMed

Publication types