. 2024 Jan;625(7993):92-100.

doi: 10.1038/s41586-023-06045-0. Epub 2023 Dec 6.

A genomic mutational constraint map using variation in 76,156 human genomes

Siwei Chen^#^{1

2}, Laurent C Francioli^#^{3

4}, Julia K Goodrich³, Ryan L Collins^{3

5

6}, Masahiro Kanai^{3

4}, Qingbo Wang^{3

7}, Jessica Alföldi^{3

4}, Nicholas A Watts^{3

4}, Christopher Vittal^{3

4}, Laura D Gauthier⁸, Timothy Poterba^{3

4

9}, Michael W Wilson^{3

4}, Yekaterina Tarasova³, William Phu^{3

10}, Riley Grant³, Mary T Yohannes³, Zan Koenig^{4

9}, Yossi Farjoun¹¹, Eric Banks⁸, Stacey Donnelly¹², Stacey Gabriel¹³, Namrata Gupta^{3

13}, Steven Ferriera¹³, Charlotte Tolonen⁸, Sam Novod⁸, Louis Bergelson⁸, David Roazen⁸, Valentin Ruano-Rubio⁸, Miguel Covarrubias⁸, Christopher Llanwarne⁸, Nikelle Petrillo⁸, Gordon Wade⁸, Thibault Jeandet⁸, Ruchi Munshi⁸, Kathleen Tibbetts⁸; Genome Aggregation Database Consortium; Anne O'Donnell-Luria^{3

5

10}, Matthew Solomonson^{3

4}, Cotton Seed^{4

9}, Alicia R Martin^{3

4

9}, Michael E Talkowski^{3

5

9}, Heidi L Rehm^{3

5}, Mark J Daly^{3

4

14}, Grace Tiao^{3

4}, Benjamin M Neale^#^{3

4}, Daniel G MacArthur^#^{3

15

16}, Konrad J Karczewski^{17

18

19}

Collaborators, Affiliations

Collaborators

Genome Aggregation Database Consortium:
Maria Abreu, Carlos A Aguilar Salinas, Tariq Ahmad, Christine M Albert, Diego Ardissino, Irina M Armean, Elizabeth G Atkinson, Gil Atzmon, John Barnard, Samantha M Baxter, Laurent Beaugerie, Emelia J Benjamin, David Benjamin, Michael Boehnke, Lori L Bonnycastle, Erwin P Bottinger, Donald W Bowden, Matthew J Bown, Harrison Brand, Steven Brant, Ted Brookings, Sam Bryant, Sarah E Calvo, Hannia Campos, John C Chambers, Juliana C Chan, Katherine R Chao, Sinéad Chapman, Daniel I Chasman, Rex Chisholm, Judy Cho, Rajiv Chowdhury, Mina K Chung, Wendy K Chung, Kristian Cibulskis, Bruce Cohen, Kristen M Connolly, Adolfo Correa, Beryl B Cummings, Dana Dabelea, John Danesh, Dawood Darbar, Phil Darnowsky, Joshua Denny, Ravindranath Duggirala, Josée Dupuis, Patrick T Ellinor, Roberto Elosua, James Emery, Eleina England, Jeanette Erdmann, Tõnu Esko, Emily Evangelista, Diane Fatkin, Jose Florez, Andre Franke, Jack Fu, Martti Färkkilä, Kiran Garimella, Jeff Gentry, Gad Getz, David C Glahn, Benjamin Glaser, Stephen J Glatt, David Goldstein, Clicerio Gonzalez, Leif Groop, Sanna Gudmundsson, Andrea Haessly, Christopher Haiman, Ira Hall, Craig L Hanis, Matthew Harms, Mikko Hiltunen, Matti M Holi, Christina M Hultman, Chaim Jalas, Mikko Kallela, Diane Kaplan, Jaakko Kaprio, Sekar Kathiresan, Eimear E Kenny, Bong-Jo Kim, Young Jin Kim, Daniel King, George Kirov, Jaspal Kooner, Seppo Koskinen, Harlan M Krumholz, Subra Kugathasan, Soo Heon Kwak, Markku Laakso, Nicole Lake, Trevyn Langsford, Kristen M Laricchia, Terho Lehtimäki, Monkol Lek, Emily Lipscomb, Ruth J F Loos, Wenhan Lu, Steven A Lubitz, Teresa Tusie Luna, Ronald C W Ma, Gregory M Marcus, Jaume Marrugat, Kari M Mattila, Steven McCarroll, Mark I McCarthy, Jacob L McCauley, Dermot McGovern, Ruth McPherson, James B Meigs, Olle Melander, Andres Metspalu, Deborah Meyers, Eric V Minikel, Braxton D Mitchell, Vamsi K Mootha, Aliya Naheed, Saman Nazarian, Peter M Nilsson, Michael C O'Donovan, Yukinori Okada, Dost Ongur, Lorena Orozco, Michael J Owen, Colin Palmer, Nicholette D Palmer, Aarno Palotie, Kyong Soo Park, Carlos Pato, Ann E Pulver, Dan Rader, Nazneen Rahman, Alex Reiner, Anne M Remes, Dan Rhodes, Stephen Rich, John D Rioux, Samuli Ripatti, Dan M Roden, Jerome I Rotter, Nareh Sahakian, Danish Saleheen, Veikko Salomaa, Andrea Saltzman, Nilesh J Samani, Kaitlin E Samocha, Alba Sanchis-Juan, Jeremiah Scharf, Molly Schleicher, Heribert Schunkert, Sebastian Schönherr, Eleanor G Seaby, Svati H Shah, Megan Shand, Ted Sharpe, Moore B Shoemaker, Tai Shyong, Edwin K Silverman, Moriel Singer-Berk, Pamela Sklar, Jonathan T Smith, J Gustav Smith, Hilkka Soininen, Harry Sokol, Rachel G Son, Jose Soto, Tim Spector, Christine Stevens, Nathan O Stitziel, Patrick F Sullivan, Jaana Suvisaari, E Shyong Tai, Kent D Taylor, Yik Ying Teo, Ming Tsuang, Tiinamaija Tuomi, Dan Turner, Teresa Tusie-Luna, Erkki Vartiainen, Marquis Vawter, Lily Wang, Arcturus Wang, James S Ware, Hugh Watkins, Rinse K Weersma, Ben Weisburd, Maija Wessman, Nicola Whiffin, James G Wilson, Ramnik J Xavier

Affiliations

¹ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. siwei@broadinstitute.org.
² Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. siwei@broadinstitute.org.
³ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
⁵ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁶ Division of Medical Sciences, Harvard Medical School, Boston, MA, USA.
⁷ Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan.
⁸ Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁹ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁰ Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA.
¹¹ Richards Lab, Lady Davis Institute, Montreal, Quebec, Canada.
¹² Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹³ Broad Genomics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁴ Institute for Molecular Medicine Finland (FIMM), Helsinki, Finland.
¹⁵ Centre for Population Genomics, Garvan Institute of Medical Research and UNSW Sydney, Sydney, New South Wales, Australia.
¹⁶ Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia.
¹⁷ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. konradk@broadinstitute.org.
¹⁸ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. konradk@broadinstitute.org.
¹⁹ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA. konradk@broadinstitute.org.

^# Contributed equally.

PMID: 38057664
PMCID: PMC11629659
DOI: 10.1038/s41586-023-06045-0

A genomic mutational constraint map using variation in 76,156 human genomes

Siwei Chen et al. Nature. 2024 Jan.

. 2024 Jan;625(7993):92-100.

doi: 10.1038/s41586-023-06045-0. Epub 2023 Dec 6.

Authors

Collaborators

Genome Aggregation Database Consortium:
Maria Abreu, Carlos A Aguilar Salinas, Tariq Ahmad, Christine M Albert, Diego Ardissino, Irina M Armean, Elizabeth G Atkinson, Gil Atzmon, John Barnard, Samantha M Baxter, Laurent Beaugerie, Emelia J Benjamin, David Benjamin, Michael Boehnke, Lori L Bonnycastle, Erwin P Bottinger, Donald W Bowden, Matthew J Bown, Harrison Brand, Steven Brant, Ted Brookings, Sam Bryant, Sarah E Calvo, Hannia Campos, John C Chambers, Juliana C Chan, Katherine R Chao, Sinéad Chapman, Daniel I Chasman, Rex Chisholm, Judy Cho, Rajiv Chowdhury, Mina K Chung, Wendy K Chung, Kristian Cibulskis, Bruce Cohen, Kristen M Connolly, Adolfo Correa, Beryl B Cummings, Dana Dabelea, John Danesh, Dawood Darbar, Phil Darnowsky, Joshua Denny, Ravindranath Duggirala, Josée Dupuis, Patrick T Ellinor, Roberto Elosua, James Emery, Eleina England, Jeanette Erdmann, Tõnu Esko, Emily Evangelista, Diane Fatkin, Jose Florez, Andre Franke, Jack Fu, Martti Färkkilä, Kiran Garimella, Jeff Gentry, Gad Getz, David C Glahn, Benjamin Glaser, Stephen J Glatt, David Goldstein, Clicerio Gonzalez, Leif Groop, Sanna Gudmundsson, Andrea Haessly, Christopher Haiman, Ira Hall, Craig L Hanis, Matthew Harms, Mikko Hiltunen, Matti M Holi, Christina M Hultman, Chaim Jalas, Mikko Kallela, Diane Kaplan, Jaakko Kaprio, Sekar Kathiresan, Eimear E Kenny, Bong-Jo Kim, Young Jin Kim, Daniel King, George Kirov, Jaspal Kooner, Seppo Koskinen, Harlan M Krumholz, Subra Kugathasan, Soo Heon Kwak, Markku Laakso, Nicole Lake, Trevyn Langsford, Kristen M Laricchia, Terho Lehtimäki, Monkol Lek, Emily Lipscomb, Ruth J F Loos, Wenhan Lu, Steven A Lubitz, Teresa Tusie Luna, Ronald C W Ma, Gregory M Marcus, Jaume Marrugat, Kari M Mattila, Steven McCarroll, Mark I McCarthy, Jacob L McCauley, Dermot McGovern, Ruth McPherson, James B Meigs, Olle Melander, Andres Metspalu, Deborah Meyers, Eric V Minikel, Braxton D Mitchell, Vamsi K Mootha, Aliya Naheed, Saman Nazarian, Peter M Nilsson, Michael C O'Donovan, Yukinori Okada, Dost Ongur, Lorena Orozco, Michael J Owen, Colin Palmer, Nicholette D Palmer, Aarno Palotie, Kyong Soo Park, Carlos Pato, Ann E Pulver, Dan Rader, Nazneen Rahman, Alex Reiner, Anne M Remes, Dan Rhodes, Stephen Rich, John D Rioux, Samuli Ripatti, Dan M Roden, Jerome I Rotter, Nareh Sahakian, Danish Saleheen, Veikko Salomaa, Andrea Saltzman, Nilesh J Samani, Kaitlin E Samocha, Alba Sanchis-Juan, Jeremiah Scharf, Molly Schleicher, Heribert Schunkert, Sebastian Schönherr, Eleanor G Seaby, Svati H Shah, Megan Shand, Ted Sharpe, Moore B Shoemaker, Tai Shyong, Edwin K Silverman, Moriel Singer-Berk, Pamela Sklar, Jonathan T Smith, J Gustav Smith, Hilkka Soininen, Harry Sokol, Rachel G Son, Jose Soto, Tim Spector, Christine Stevens, Nathan O Stitziel, Patrick F Sullivan, Jaana Suvisaari, E Shyong Tai, Kent D Taylor, Yik Ying Teo, Ming Tsuang, Tiinamaija Tuomi, Dan Turner, Teresa Tusie-Luna, Erkki Vartiainen, Marquis Vawter, Lily Wang, Arcturus Wang, James S Ware, Hugh Watkins, Rinse K Weersma, Ben Weisburd, Maija Wessman, Nicola Whiffin, James G Wilson, Ramnik J Xavier

Affiliations

¹ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. siwei@broadinstitute.org.
² Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. siwei@broadinstitute.org.
³ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
⁵ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁶ Division of Medical Sciences, Harvard Medical School, Boston, MA, USA.
⁷ Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan.
⁸ Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁹ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁰ Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA.
¹¹ Richards Lab, Lady Davis Institute, Montreal, Quebec, Canada.
¹² Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹³ Broad Genomics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁴ Institute for Molecular Medicine Finland (FIMM), Helsinki, Finland.
¹⁵ Centre for Population Genomics, Garvan Institute of Medical Research and UNSW Sydney, Sydney, New South Wales, Australia.
¹⁶ Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia.
¹⁷ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. konradk@broadinstitute.org.
¹⁸ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. konradk@broadinstitute.org.
¹⁹ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA. konradk@broadinstitute.org.

^# Contributed equally.

PMID: 38057664
PMCID: PMC11629659
DOI: 10.1038/s41586-023-06045-0

Erratum in

Author Correction: A genomic mutational constraint map using variation in 76,156 human genomes.
Chen S, Francioli LC, Goodrich JK, Collins RL, Kanai M, Wang Q, Alföldi J, Watts NA, Vittal C, Gauthier LD, Poterba T, Wilson MW, Tarasova Y, Phu W, Grant R, Yohannes MT, Koenig Z, Farjoun Y, Banks E, Donnelly S, Gabriel S, Gupta N, Ferriera S, Tolonen C, Novod S, Bergelson L, Roazen D, Ruano-Rubio V, Covarrubias M, Llanwarne C, Petrillo N, Wade G, Jeandet T, Munshi R, Tibbetts K; Genome Aggregation Database Consortium; O'Donnell-Luria A, Solomonson M, Seed C, Martin AR, Talkowski ME, Rehm HL, Daly MJ, Tiao G, Neale BM, MacArthur DG, Karczewski KJ. Chen S, et al. Nature. 2024 Feb;626(7997):E1. doi: 10.1038/s41586-024-07050-7. Nature. 2024. PMID: 38225470 No abstract available.

Abstract

The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders^1-4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)-the largest public open-access human genome allele frequency reference dataset-and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.

PubMed Disclaimer

Figures

**Extended Data Fig. 1:**
Construction of mutational model and Gnocchi score. a,b, Estimation of trinucleotide context-specific mutation rates. The proportion of possible variants observed for each substitution and context in 76,156 gnomAD genomes (y-axis) is exponentially correlated with the absolute mutation rate estimated from 1,000 downsampled genomes (x-axis). Fit lines were modeled separately for human autosomes (a) and chromosome X (b). c, Estimation of the effects of regional genomic features on mutation rates. The effects of 13 genomic features at four scales (window sizes 1kb-1Mb; x-axis) on the mutation rate of 32 trinucleotide contexts (y-axis) are shown, colored by the coefficient from regressing *de novo* mutations (DNMs) on each specific feature and window size. Red/Blue color indicates a positive/negative effect of increasing the feature value on mutation rates; grey crosses indicate significant features at the smallest possible window size after Bonferroni correction for 13×4=52 tests. Abbreviations: LCR=low-complexity region, SINE/LINE=short/long interspersed nuclear element, Dist=Distance, Recomb=Recombination, Methyl=Methylation. d,e, The distribution of Gnocchi score as a function of expected and observed variation. Each point represents the Gnocchi score of a 1kb window on the genome (N=1,984,900 on autosomes (d) and N=57,729 on chromosome X (e)), which quantifies the deviation of observed variation from expectation. A positive Gnocchi score (red) indicates depletion of variation (observed<expected) and the higher the score the stronger the depletion; the red dashed line indicates the 99^th percentile of Gnocchi scores across the autosomes (d) or chromosome X (e).

**Extended Data Fig. 2:**
Comparison of Gnocchi score between coding and non-coding regions. a, The proportion of highly constrained windows (Gnocchi≥4) as a function of the percentage of coding sequences in a window (left to right: N=1,906/49,525, 3,244/55,676, 2,240/18,461, 1,506/7,094, 969/3,519, 569/1,946, 364/1,223, 283/910, 243/724, 10,392/30,138). The intervals (x-axis) are left exclusive and right inclusive. “Exonic only” refers to the 1kb windows created from directly concatenating coding exons into 1kb sequences. Error bars indicate standard errors of the proportions. b, The exonic-only regions (N=27,875; purple) present a significantly higher Gnocchi score than regions that are exclusively non-coding (N=1,843,559; blue). Dashed lines indicate the medians. c, The proportion of highly constrained windows (Gnocchi≥4) as a function of the proportion of exonic windows being added to the dataset of non-coding windows. d, Gnocchi score percentiles of non-coding versus exonic windows. About 0.05% (100–99.95%) and 3.12% (100–96.88%) of the non-coding windows exhibit similar constraint to the 90^th and 50^th of exonic regions, respectively.

**Extended Data Fig. 3:**
Estimation of constraint for aggregated regulatory annotations. a,b, Gnocchi scores of aggregated promoter (dark purple), enhancer (light purple), microRNA (miRNA; dark blue), and long non-coding RNA (lncRNA; light blue) annotations are compared against those of exonic (a) and non-coding (b) regions at a 1kb scale. The Gnocchi score percentiles of each annotation (y-axis) are benchmarked by the score deciles of exonic or non-coding regions (10–100 percentiles; x-axis); the grey dashed vertical line indicates the median (50^th percentile).

**Extended Data Fig. 4:**
Applications of Gnocchi for characterizing non-coding regions in addition to existing functional annotations. a, Use of Gnocchi for prioritizing non-coding regions with or without a regulatory annotation (N=464,504 and 1,379,055, respectively). Constrained non-coding regions are enriched for GWAS variants, independent of the candidate cis-regulatory element (cCRE) annotation from ENCODE. Error bars indicate 95% confidence intervals of the odds ratios. b, Use of Gnocchi in statistical fine-mapping. The increase in posterior inclusion probability (PIP) when incorporating Gnocchi score as a functional prior into previous fine-mapping results (that used a uniform prior; denoted as PIP_Gnocchi and PIP_unif, respectively) is shown for 164 new likely causal associations with a PIP_Gnocchi ≥0.8 as a function of PIP_Gnocchi.

**Extended Data Fig. 5:**
Comparison of Gnocchi and other predictive metrics in prioritizing non-coding variants. a, Receiver operating characteristic (ROC) curves of Gnocchi and other seven metrics in classifying putative functional non-coding variants (“positive” variant set) – left to right: 9,229 GWAS Catalog variants, 2,191 GWAS fine-mapping variants, a subset of 140 high-confidence fine-mapped variants, and 1,026 likely pathogenic variants – against “negative” variant set randomly drew from the population with a similar allele frequency (AF). AF>5% and allele count (AC)=1 were applied respectively for matching the three GWAS variant sets and the likely pathogenic variant set, based on their AF distributions in TOPMed (shown in b). b, AUCs of the classification with a varying AF threshold for the negative variant set. As most GWAS variants are common and most likely pathogenic variants are very rare (not seen in the population), AF>5% and AC=1 were applied respectively in the primary analyses shown in a.

**Extended Data Fig. 6:**
Comparison of constraint scores built from different mutational models and genomic windows. Gnocchi (presented in this study) outperforms the scores rebuilt from mutational models that only consider local sequence context – trinucleotide (trimer-only) or heptanucleotide (heptamer-only) – without adjustment on mutation rate by regional genomic features, and the performance is robust to the artificial break of genomic windows when computed at a 1kb sliding by 100bp scale.

**Extended Data Fig. 7:**
Pairwise correlations between different constraint/conservation metrics. The Spearman’s rank correlation between each pair of the eight metrics was computed based on the mean value of each score on 1kb windows across the genome.

**Extended Data Fig. 8:**
Power of constraint detection. a,b, The sample size required for well-powered non-coding constraint detection. The percentage of non-coding regions powered to detect constraint (Gnocchi≥4) at a 1kb (a) and 100bp (b) scale under varying levels of selection (depletion of variation) is shown as a function of log-scaled sample size. Lighter color indicates milder deletion of variation (weaker selection), which requires a larger sample size to detect constraint; the grey dashed vertical line indicates the current sample size of 76,156 genomes. Dotted curves (left to right) benchmark the 95^th, 90^th, and 50^th percentile of depletion of variation observed in coding exons of similar size. The number of samples required to obtain an 80% detection power is labeled at corresponding benchmarks. c, AUCs of Gnocchi scores computed on different window sizes in identifying putative functional non-coding variants. 1kb (used in this study) presents the optimal window size with high performance while maintaining reasonable resolution. d, AUCs of Gnocchi scores computed from different subsets of gnomAD in identifying putative functional non-coding variants. While with an equal sample size, the downsampled dataset with diverse ancestries presents higher performance than the Non-Finnish European (NFE)-only dataset.

**Fig. 1:**
Distribution of Gnocchi scores across the genome. a, Histograms of Gnocchi scores for 1,984,900 1kb windows across the human autosomes. Windows overlapping coding regions (N=141,341 with ≥ 1bp coding sequence; red) overall exhibit a higher Gnocchi score (stronger negative selection) than windows that are exclusively non-coding (N=1,843,559; blue); dashed lines indicate the medians. b, The correlation between Gnocchi score and the adjusted proportion of singletons (APS) score developed for structural variation (SV) constraint. A collection of 116,184 autosomal SVs were assessed using Gnocchi by assigning each SV the highest Gnocchi score among all overlapping 1kb windows, which shows a significant positive correlation with the SV constraint metric APS. Error bars indicate 100-fold bootstrapped 95% confidence intervals of the mean values.

**Fig. 2:**
Correlation between Gnocchi and functional non-coding annotations. a,b, Distributions of candidate regulatory elements (a) and GWAS variants (b) along the spectrum of Gnocchi in non-coding regions. Enrichment was evaluated by comparing the proportion of non-coding 1kb windows, binned by Gnocchi, that overlap with a given functional annotation to the genome-wide average. Error bars indicate 95% confidence intervals of the odds ratios. cCRE, candidate cis-regulatory element: N=34,803 with a promoter-like signature (PLS), N=141,830 with a proximal enhancer-like signature (pELS), N=667,599 with a distal enhancer-like signature (dELS), N=56,766 bound by CTCF without a regulatory signature (CTCF-only); Super enhancers: N=331,601; FANTOM enhancers: N=63,285; GWAS Catalog: N=111,308 variants with an association P ≤5.0×10⁻⁸, N=9,229 with an independent replication; GWAS fine-mapping: N=2,191 variants fine-mapped with a posterior inclusion probability of causality≥0.9. See Methods for details on data collection. c, Enrichment of fine-mapped variants in constrained non-coding regions (Gnocchi≥4). Credible set (CS)-trat pairs with a significant enrichment are shown, ordered by the lower bound of 95% confidence interval; only lower bounds are shown for presentation purposes. d, The distribution of variants fine-mapped for coronary artery disease (CAD) in constrained regions (Gnocchi≥4) of *PLG*. Each bar shows the Gnocchi score of a 1kb window (gaps indicate windows removed by quality filters); windows containing fine-mapped variants are colored by purple, and the number of variants in each window is annotated on top of the bar correspondingly. Ten variants are located within *PLG* introns, four are mapped to the antisense gene of *PLG* (ENSG00000287558), and 14 reside in the downstream intergenic regions.

**Fig. 3:**
Performance of Gnocchi and other predictive metrics in prioritizing non-coding variants. a,b, Receiver operating characteristic (ROC) curves of Gnocchi and other seven metrics in classifying putative functional non-coding variants – 2,191 GWAS fine-mapping variants (a) and 1,026 likely pathogenic variants (b) – against background variants in the population. The performance of each metric was measured and ranked by the area under curve (AUC) statistic. c,d, The relative contribution of different metrics in classifying GWAS variants (b) and likely pathogenic variants (c). The eight metrics were modeled as eight independent predictors for the classification, and the relative contribution of one predictor over another was evaluated by estimating their additional $R^{2}$ contributions across all subset models.

**Fig. 4:**
Contribution of non-coding constraint in evaluating copy number variants (CNVs). a, Proportions of constrained CNVs (Gnocchi≥4) identified in individuals with developmental delay (DD cases) versus healthy controls. Constrained CNVs are more common in DD cases than controls (7,239/17,004=42.6% versus 10,403/83,526=12.5%) and are most frequent for CNVs previously implicated as pathogenic (18/19=94.7% by DD and 3,433/4,014=85.5% by ClinVar). Error bars indicate standard errors of the proportions. b, Contribution of non-coding constraint to predicting CNVs in DD cases versus controls. Non-coding constraint remains a significant predictor for the case/control status of CNVs after adjusting for gene constraint (LOEUF score), gene number, and size of CNVs (N_case=17,004, N_control=83,526; purple), as well as being tested in the subset of non-coding CNVs (N_case=8,702, N_control=66,795; blue). Error bars indicate 95% confidence intervals of the log odds ratios. c, CNVs at the *IHH* locus associated with synpolydactyly and craniosynostosis. The four implicated duplications (grey horizontal bars) span a ~102kb sequence upstream of *IHH*. Each vertical bar shows the Gnocchi score of a 1kb window within the locus, with the highest score overlapping the *IHH* gene (red) and the highest non-coding score overlapping the major *IHH* enhancers (purple); gaps indicate windows removed by quality filters. d, Non-coding CNVs with the highest Gnocchi score identified in DD cases. The highest-scored window is located within the potential “critical region” (purple vertical bars) shared by 12 DD deletions (red horizontal bars; grey indicates two deletions observed in controls). The critical region overall, has a significantly higher Gnocchi score than the other regions affected by DD or control deletions, as shown in the kernel density estimate (KDE) plot on the right.

**Fig. 5:**
Correlation of constraint between non-coding regulatory elements and protein-coding genes. a, The proportion of non-coding 1kb windows overlapping with enhancers that were predicted to regulate specific genes, as a function of their Gnocchi scores. More constrained non-coding regions are more frequently linked to a gene (left to right: N=2,022/62,894, 2,743/62,653, 7,475/134,279, 20,383/252,354, 43,414/376,829, 66,343/417,743, 65,343/313,110, 38,785/152,787, 15,417/51,439, 6,663/19,471). Error bars indicate standard errors of the proportions. b, Comparison of the Gnocchi scores of enhancers linked to constrained and unconstrained genes. Enhancers of established sets of constrained genes (four blue boxes: N=189 haploinsufficient genes, N=2,454 essential genes, N=1,771 autosomal dominant disease genes, N=1,920 LOEUF-predicted constrained genes) are more constrained than enhancers of presumably less constrained genes (two grey boxes: N=356 olfactory receptor genes, N=189 LOEUF-predicted unconstrained genes). Enhancers of genes that are underpowered for gene constraint detection (“LOEUF underpowered”, N=1,117) present a higher constraint than those powered yet unconstrained genes (“LOUEF unconstrained”). The box plots show the distribution of Gnocchi scores of enhancers linked to different gene sets, denoting the median, quartiles and range (excepting outliers). c, Improvement of incorporating enhancer constraint into LOEUF in prioritizing underpowered genes. ROC curves and AUCs show the performance of two logistic regression models using LOEUF (blue) and LOEUF+enhancer Gnocchi score (purple) as independent predictive variables to classify constrained and unconstrained genes, tested on a set of 77 underpowered genes. d, Contribution of enhancer constraint to predicting gene expression in specific tissue types. The x-axis shows the linear regression coefficient of tissue-specific enhancer Gnocchi score predicting the expression level of target genes in matched tissue types (N_HSC&B-cell=11,970, N_Brain=11,555, N_Heart=10,759, N_Pancreas=10,572, N_Blood&T-cell=10,403, N_Muscle=10,380, N_Adipose=9,316, N_Liver=8,838, N_Spleen=8,308, N_Ovary=7,926, N_Lung=7,499), conditioning on gene constraint (LOEUF score). Error bars indicate 95% confidence intervals of the coefficient estimates.

See this image and copyright information in PMC

References

1. Short PJ et al. De novo mutations in regulatory elements in neurodevelopmental disorders. Nature 555, 611–616, doi:10.1038/nature25983 (2018). - DOI - PMC - PubMed
1. Satterstrom FK et al. Large-Scale Exome Sequencing Study Implicates Both Developmental and Functional Changes in the Neurobiology of Autism. Cell 180, 568–584 e523, doi:10.1016/j.cell.2019.12.036 (2020). - DOI - PMC - PubMed
1. Singh T et al. The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nat Genet 49, 1167–1173, doi:10.1038/ng.3903 (2017). - DOI - PMC - PubMed
1. Ganna A et al. Quantifying the Impact of Rare and Ultra-rare Coding Variation across the Phenotypic Spectrum. Am J Hum Genet 102, 1204–1211, doi:10.1016/j.ajhg.2018.05.002 (2018). - DOI - PMC - PubMed
1. Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443, doi:10.1038/s41586-020-2308-7 (2020). - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A genomic mutational constraint map using variation in 76,156 human genomes

Collaborators

Affiliations

A genomic mutational constraint map using variation in 76,156 human genomes

Authors

Collaborators

Affiliations

Erratum in

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials