Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 6;109(1):33-49.
doi: 10.1016/j.ajhg.2021.12.001. Epub 2021 Dec 23.

Overcoming constraints on the detection of recessive selection in human genes from population frequency data

Affiliations

Overcoming constraints on the detection of recessive selection in human genes from population frequency data

Daniel J Balick et al. Am J Hum Genet. .

Abstract

The identification of genes that evolve under recessive natural selection is a long-standing goal of population genetics research that has important applications to the discovery of genes associated with disease. We found that commonly used methods to evaluate selective constraint at the gene level are highly sensitive to genes under heterozygous selection but ubiquitously fail to detect recessively evolving genes. Additionally, more sophisticated likelihood-based methods designed to detect recessivity similarly lack power for a human gene of realistic length from current population sample sizes. However, extensive simulations suggested that recessive genes may be detectable in aggregate. Here, we offer a method informed by population genetics simulations designed to detect recessive purifying selection in gene sets. Applying this to empirical gene sets produced significant enrichments for strong recessive selection in genes previously inferred to be under recessive selection in a consanguineous cohort and in genes involved in autosomal recessive monogenic disorders.

Keywords: constraint scores; genetic dominance; inference of selection; mode of inheritance; population genetics; recessive human genes; recessive selection; site frequency spectrum.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests R.D. received grants from AstraZeneca and grants and non-financial support from Goldfinch Bio and is a scientific co-founder, equity holder, and consultant for Pensieve Health and a consultant for Variant Bio, all not related to this work. D.J.B., D.M.J., and S.S. declare no competing interests.

Figures

Figure 1
Figure 1
Enrichment for genes under selection with various per-gene metrics (A and B) Points show enrichment for genes showing evidence of selection according to each per-gene metric, expressed as log odds ratio. Lines show 95% confidence intervals. The metrics shown include scores based on population constraint within humans (ratio of nonsynonymous to synonymous nucleotide diversity πns/πs, number of nonsynonymous segregating sites, and constraint scores pLI, OE, RVIS, and shet40), scores based on conservation between species (dN/dS score,, phastCons conservation score22), and a hybrid score that includes both (McDonald-Kreitman neutrality index42). For details of how these scores were processed, see material and methods. (A) The putatively recessive ConsangBP gene set showed either a depletion for genes with evidence of selection or consistency with the genome average in all scores tested. (B) The putatively non-recessive HI80 gene set showed either an enrichment for genes with evidence of selection or consistency with the genomic background in all scores tested.
Figure 2
Figure 2
Power to correctly predict strong recessive selection in a single gene as a function of gene length (A and B) Power to correctly identify simulated genes under strong recessive selection (h=0,s=0.1, red) is plotted as a function of gene length. False positive rates are shown for genes under additive selection with varying selection strengths: strong selection (s=0.1, blue), weak selection (s=0.01, orange and s=0.001, green), and neutrality (s=0, purple). Each curve shows the mean of all simulations, and the shaded area around each curve shows a 95% confidence interval calculated by bootstrap sampling. The dotted line indicates the median per-gene log target size for LOF and damaging mutations in ExAC NFE and the dashed lines indicate the 2.5th and 97.5th percentiles such that 95% of genes lie within. For simplicity, all sites within each gene are assumed to have uniform dominance and selection coefficients, which may be a reasonable approximation within a single functional class (e.g., synonymous sites only, damaging sites only). n = 1,000 replicate simulations were performed with a demography inferred from the ExAC NFE sample. (A) Power and false positive rates using the nested log likelihood ratio test to reject additivity of all selection strengths. Roughly 10,000 sites (corresponding to a mutational target size of 104) are needed to gain sufficient power. Virtually no genes in the human genome have LOF and damaging target sizes on this order (see Figure S1). (B) Power and false positive rates using the srML test with the same likelihood function. Some power can be seen starting at roughly 300 sites (mutational target size of 3×106). However, false positives persist at high fractions, particularly for additive variation under weak selection. Precise error rates for a set of genes depend both on the length distribution and on the distribution of diploid selection coefficients.
Figure 3
Figure 3
Power to detect enrichment for genes under strong recessive selection (A–C) Power of the srML test to detect the enrichment of genes under strong recessive selection is plotted as a function of gene set size and true odds ratio of the gene set for two different values of the parameter fR, representing the fraction of the genome that is under strong recessive selection. Each curve shows the mean of all simulations, and the shaded area around each curve shows a 95% confidence interval calculated by bootstrap sampling. Qualitative features do not depend on the genome-wide prevalence of strong recessive selection, but the significance threshold is highly sensitive to this parameter. In all cases, the srML test was used to score individual genes and enrichment of genes predicted recessive was evaluated with a χ2 contingency test. (A) Power to detect enrichment of recessive selection for gene sets of various sizes for a simulated genome with 10% of genes under strong recessive selection. (B) The comparable power plot for a simulated genome with 3% of genes under strong recessive selection. (C) Estimated odds ratio versus simulated odds ratio for gene sets of size n = 300 with varying fractions of the genome under strong recessive selection. srML universally underestimates the true enrichment (slope <1) but is a better estimate for larger recessive fractions of the background (an unknown quantity in humans). Gene set size affects the variance of this dependence but not the slope.
Figure 4
Figure 4
Enrichment of literature-based gene sets for genes under strong recessive or strong additive natural selection (A and B) Log odds ratio of enrichment for genes called “strong recessive” by the srML test (red) or genes called “strong additive” by the analogous test for strong additive selection (black) for various gene sets. Lines show 95% confidence intervals. (A) ConsangBP contains genes inferred to be under recessive selection from a consanguineous British Pakistani population; HI80 contains genes with evidence of haploinsufficiency from an empirically derived haploinsufficiency score; HI20 contains genes with evidence against haploinsufficiency from the same empirically derived haploinsufficiency score; CGD and DD2GP AR and AD contain genes known to be implicated in autosomal recessive or autosomal dominant disease, respectively. CGD AR pediatric immune contains genes associated with pediatric onset immune system disease, severe allergy, and infection susceptibility. (B) CGD ARHQ and ADHQ sets consist of the subset of CGD AR or AD genes harboring more than one variant annotated with a quality score of two stars (“multiple submitters, no conflicts”) or higher in ClinVar; CGD ARLQ and ADLQ consist of all genes in CGD AR or AD that do not meet this criterion; lethal ARHQ contains genes that are both confidently identified as causal for effectively lethal AR disease and pass the ClinVar quality filter.,

References

    1. Zhu Z., Bakshi A., Vinkhuyzen A.A., Hemani G., Lee S.H., Nolte I.M., van Vliet-Ostaptchouk J.V., Snieder H., Esko T., Milani L., et al. Dominance genetic variation contributes little to the missing heritability for human complex traits. Am. J. Hum. Genet. 2015;96:377–385. doi: 10.1016/j.ajhg.2015.01.001. - DOI - PMC - PubMed
    1. Chong J.X., Buckingham K.J., Jhangiani S.N., Boehm C., Sobreira N., Smith J.D., Harrell T.M., McMillin M.J., Wiszniewski W., Gambin T., et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 2015;97:199–215. doi: 10.1016/j.ajhg.2015.06.009. - DOI - PMC - PubMed
    1. Turner T.N., Douville C., Kim D., Stenson P.D., Cooper D.N., Chakravarti A., Karchin R. Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic rare missense mutation distribution patterns. Hum. Mol. Genet. 2015;24:5995–6002. doi: 10.1093/hmg/ddv309. - DOI - PMC - PubMed
    1. Schriml L.M., Mitraka E., Munro J., Tauber B., Schor M., Nickle L., Felix V., Jeng L., Bearer C., Lichenstein R., et al. Human Disease Ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 2019;47(D1):D955–D962. doi: 10.1093/nar/gky1032. - DOI - PMC - PubMed
    1. Veitia R.A., Caburet S., Birchler J.A. Mechanisms of Mendelian dominance. Clin. Genet. 2018;93:419–428. doi: 10.1111/cge.13107. - DOI - PubMed

Publication types

LinkOut - more resources