. 2022 Jan 6;109(1):33-49.

doi: 10.1016/j.ajhg.2021.12.001. Epub 2021 Dec 23.

Overcoming constraints on the detection of recessive selection in human genes from population frequency data

Daniel J Balick¹, Daniel M Jordan², Shamil Sunyaev³, Ron Do⁴

Affiliations

¹ Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
² The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
³ Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA. Electronic address: ssunyaev@rics.bwh.harvard.edu.
⁴ The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA. Electronic address: ron.do@mssm.edu.

PMID: 34951958
PMCID: PMC8764206
DOI: 10.1016/j.ajhg.2021.12.001

Overcoming constraints on the detection of recessive selection in human genes from population frequency data

Daniel J Balick et al. Am J Hum Genet. 2022.

. 2022 Jan 6;109(1):33-49.

doi: 10.1016/j.ajhg.2021.12.001. Epub 2021 Dec 23.

Authors

Daniel J Balick¹, Daniel M Jordan², Shamil Sunyaev³, Ron Do⁴

Affiliations

¹ Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
² The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
³ Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA. Electronic address: ssunyaev@rics.bwh.harvard.edu.
⁴ The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA. Electronic address: ron.do@mssm.edu.

PMID: 34951958
PMCID: PMC8764206
DOI: 10.1016/j.ajhg.2021.12.001

Abstract

The identification of genes that evolve under recessive natural selection is a long-standing goal of population genetics research that has important applications to the discovery of genes associated with disease. We found that commonly used methods to evaluate selective constraint at the gene level are highly sensitive to genes under heterozygous selection but ubiquitously fail to detect recessively evolving genes. Additionally, more sophisticated likelihood-based methods designed to detect recessivity similarly lack power for a human gene of realistic length from current population sample sizes. However, extensive simulations suggested that recessive genes may be detectable in aggregate. Here, we offer a method informed by population genetics simulations designed to detect recessive purifying selection in gene sets. Applying this to empirical gene sets produced significant enrichments for strong recessive selection in genes previously inferred to be under recessive selection in a consanguineous cohort and in genes involved in autosomal recessive monogenic disorders.

Keywords: constraint scores; genetic dominance; inference of selection; mode of inheritance; population genetics; recessive human genes; recessive selection; site frequency spectrum.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests R.D. received grants from AstraZeneca and grants and non-financial support from Goldfinch Bio and is a scientific co-founder, equity holder, and consultant for Pensieve Health and a consultant for Variant Bio, all not related to this work. D.J.B., D.M.J., and S.S. declare no competing interests.

Figures

**Figure 1**
Enrichment for genes under selection with various per-gene metrics (A and B) Points show enrichment for genes showing evidence of selection according to each per-gene metric, expressed as log odds ratio. Lines show 95% confidence intervals. The metrics shown include scores based on population constraint within humans (ratio of nonsynonymous to synonymous nucleotide diversity $π_{n s} / π_{s}$ , number of nonsynonymous segregating sites, and constraint scores pLI, OE, RVIS, and $s_{h e t}$ ⁴⁰), scores based on conservation between species (dN/dS score,^, phastCons conservation score²²), and a hybrid score that includes both (McDonald-Kreitman neutrality index⁴²). For details of how these scores were processed, see material and methods. (A) The putatively recessive ConsangBP gene set showed either a depletion for genes with evidence of selection or consistency with the genome average in all scores tested. (B) The putatively non-recessive HI80 gene set showed either an enrichment for genes with evidence of selection or consistency with the genomic background in all scores tested.

**Figure 2**
Power to correctly predict strong recessive selection in a single gene as a function of gene length (A and B) Power to correctly identify simulated genes under strong recessive selection ( $h = 0, s = - 0.1$ , red) is plotted as a function of gene length. False positive rates are shown for genes under additive selection with varying selection strengths: strong selection ( $s = - 0.1$ , blue), weak selection ( $s = - 0.01$ , orange and $s = - 0.001$ , green), and neutrality ( $s = 0$ , purple). Each curve shows the mean of all simulations, and the shaded area around each curve shows a 95% confidence interval calculated by bootstrap sampling. The dotted line indicates the median per-gene log target size for LOF and damaging mutations in ExAC NFE and the dashed lines indicate the 2.5^th and 97.5^th percentiles such that 95% of genes lie within. For simplicity, all sites within each gene are assumed to have uniform dominance and selection coefficients, which may be a reasonable approximation within a single functional class (e.g., synonymous sites only, damaging sites only). n = 1,000 replicate simulations were performed with a demography inferred from the ExAC NFE sample. (A) Power and false positive rates using the nested log likelihood ratio test to reject additivity of all selection strengths. Roughly 10,000 sites (corresponding to a mutational target size of $10^{- 4}$ ) are needed to gain sufficient power. Virtually no genes in the human genome have LOF and damaging target sizes on this order (see Figure S1). (B) Power and false positive rates using the srML test with the same likelihood function. Some power can be seen starting at roughly 300 sites (mutational target size of $3 \times 10^{- 6})$ . However, false positives persist at high fractions, particularly for additive variation under weak selection. Precise error rates for a set of genes depend both on the length distribution and on the distribution of diploid selection coefficients.

**Figure 3**
Power to detect enrichment for genes under strong recessive selection (A–C) Power of the srML test to detect the enrichment of genes under strong recessive selection is plotted as a function of gene set size and true odds ratio of the gene set for two different values of the parameter $f_{R}$ , representing the fraction of the genome that is under strong recessive selection. Each curve shows the mean of all simulations, and the shaded area around each curve shows a 95% confidence interval calculated by bootstrap sampling. Qualitative features do not depend on the genome-wide prevalence of strong recessive selection, but the significance threshold is highly sensitive to this parameter. In all cases, the srML test was used to score individual genes and enrichment of genes predicted recessive was evaluated with a χ² contingency test. (A) Power to detect enrichment of recessive selection for gene sets of various sizes for a simulated genome with 10% of genes under strong recessive selection. (B) The comparable power plot for a simulated genome with 3% of genes under strong recessive selection. (C) Estimated odds ratio versus simulated odds ratio for gene sets of size n = 300 with varying fractions of the genome under strong recessive selection. srML universally underestimates the true enrichment (slope $< 1$ ) but is a better estimate for larger recessive fractions of the background (an unknown quantity in humans). Gene set size affects the variance of this dependence but not the slope.

**Figure 4**
Enrichment of literature-based gene sets for genes under strong recessive or strong additive natural selection (A and B) Log odds ratio of enrichment for genes called “strong recessive” by the srML test (red) or genes called “strong additive” by the analogous test for strong additive selection (black) for various gene sets. Lines show 95% confidence intervals. (A) ConsangBP contains genes inferred to be under recessive selection from a consanguineous British Pakistani population; HI80 contains genes with evidence of haploinsufficiency from an empirically derived haploinsufficiency score; HI20 contains genes with evidence against haploinsufficiency from the same empirically derived haploinsufficiency score; CGD and DD2GP AR and AD contain genes known to be implicated in autosomal recessive or autosomal dominant disease, respectively. CGD AR pediatric immune contains genes associated with pediatric onset immune system disease, severe allergy, and infection susceptibility. (B) CGD ARHQ and ADHQ sets consist of the subset of CGD AR or AD genes harboring more than one variant annotated with a quality score of two stars (“multiple submitters, no conflicts”) or higher in ClinVar; CGD ARLQ and ADLQ consist of all genes in CGD AR or AD that do not meet this criterion; lethal ARHQ contains genes that are both confidently identified as causal for effectively lethal AR disease and pass the ClinVar quality filter.^,

See this image and copyright information in PMC

References

1. Zhu Z., Bakshi A., Vinkhuyzen A.A., Hemani G., Lee S.H., Nolte I.M., van Vliet-Ostaptchouk J.V., Snieder H., Esko T., Milani L., et al. Dominance genetic variation contributes little to the missing heritability for human complex traits. Am. J. Hum. Genet. 2015;96:377–385. doi: 10.1016/j.ajhg.2015.01.001. - DOI - PMC - PubMed
1. Chong J.X., Buckingham K.J., Jhangiani S.N., Boehm C., Sobreira N., Smith J.D., Harrell T.M., McMillin M.J., Wiszniewski W., Gambin T., et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 2015;97:199–215. doi: 10.1016/j.ajhg.2015.06.009. - DOI - PMC - PubMed
1. Turner T.N., Douville C., Kim D., Stenson P.D., Cooper D.N., Chakravarti A., Karchin R. Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic rare missense mutation distribution patterns. Hum. Mol. Genet. 2015;24:5995–6002. doi: 10.1093/hmg/ddv309. - DOI - PMC - PubMed
1. Schriml L.M., Mitraka E., Munro J., Tauber B., Schor M., Nickle L., Felix V., Jeng L., Bearer C., Lichenstein R., et al. Human Disease Ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 2019;47(D1):D955–D962. doi: 10.1093/nar/gky1032. - DOI - PMC - PubMed
1. Veitia R.A., Caburet S., Birchler J.A. Mechanisms of Mendelian dominance. Clin. Genet. 2018;93:419–428. doi: 10.1111/cge.13107. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 HG010372/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Overcoming constraints on the detection of recessive selection in human genes from population frequency data

Affiliations

Overcoming constraints on the detection of recessive selection in human genes from population frequency data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources