Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar;603(7899):95-102.
doi: 10.1038/s41586-022-04394-w. Epub 2022 Feb 23.

Genetic associations of protein-coding variants in human disease

Collaborators, Affiliations

Genetic associations of protein-coding variants in human disease

Benjamin B Sun et al. Nature. 2022 Mar.

Abstract

Genome-wide association studies (GWAS) have identified thousands of genetic variants linked to the risk of human disease. However, GWAS have so far remained largely underpowered in relation to identifying associations in the rare and low-frequency allelic spectrum and have lacked the resolution to trace causal mechanisms to underlying genes1. Here we combined whole-exome sequencing in 392,814 UK Biobank participants with imputed genotypes from 260,405 FinnGen participants (653,219 total individuals) to conduct association meta-analyses for 744 disease endpoints across the protein-coding allelic frequency spectrum, bridging the gap between common and rare variant studies. We identified 975 associations, with more than one-third being previously unreported. We demonstrate population-level relevance for mutations previously ascribed to causing single-gene disorders, map GWAS associations to likely causal genes, explain disease mechanisms, and systematically relate disease associations to levels of 117 biomarkers and clinical-stage drug targets. Combining sequencing and genotyping in two population biobanks enabled us to benefit from increased power to detect and explain disease associations, validate findings through replication and propose medical actionability for rare genetic variants. Our study provides a compendium of protein-coding variant associations for future insights into disease biology and drug discovery.

PubMed Disclaimer

Conflict of interest statement

B.B.S., H.R., C.-Y.C., E.M., J.W. and members of the Biogen Biobank Team are employees of Biogen. M.J.D. is a founder of Maze Therapeutics. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Coding genetic associations with disease.
a, Summary of sentinel variant associations. Size of the point is proportional to effect size. −log10(P) capped at −log10(10−50). Labels highlight pleiotropic associations (≥5 trait clusters). Colours indicate disease groups. Shapes indicate novel and known (grey circles) associations. Dotted horizontal lines indicate –log10(2 × 10−9) (brown) and −log10(5 × 10−8) (grey). b, Comparison of sentinel variant MAF between UKB and FG data. c, Effect size against MAF of sentinel variants. Dashed lines indicate MAF of 0.1% (left) and 1% (right). Genes for coding variant associations with absolute effect size greater than 2 or MAF less than 0.1% are labelled. d, Surface plot of effects of cohort specific allele enrichment on inverse variant weighted (IVW) meta-analysis z-scores (IVW uplift) across MAFs (up to MAF 1%). Uplift is defined as the ratio of meta-analysed IVW z-score to the z-score of an individual study (details in Supplementary Information). e, Density plot of MAF for sentinel variants for known versus novel associations. Interactive Manhattan plot for novel associations and allelic enrichment surface plots are provided as Supplementary Files 1, 2.
Fig. 2
Fig. 2. Biomarker associations with sentinel variants.
a, Heat map of sentinel associations with biomarkers. Only significant associations (P < 10−6) are shown. Colours on the left axis indicate chromosomes, with cyan indicating the MHC region. Colours on the right axis indicate sentinel association with disease by group. Colours along the top indicate the category of biomarkers. b, Forest plot of associations (unadjusted regression effect estimates with 95% confidence intervals (CI)) between SLC34A1 deletion (rs1460573878) with haematological and biochemistry biomarkers. associations with P < 0.05 are shown. c, Forest plot of associations (unadjusted regression effect estimate with 95% confidence interval (CI)) between CHEK2 deletion (rs555607708) with haematological biomarkers. Unadjusted P values are shown. Disease associations n = 653,219 biologically independent samples. Specific sample sizes for biomarker associations are listed in Supplementary Table 8. IGF-1, insulin-like growth factor 1; LDL, low-density lipoprotein; SHBG, sex hormone binding globulin.
Fig. 3
Fig. 3. Genetic and functional insights into atrial fibrillation.
Clustered Mendelian randomization plot of association of atrial fibrillation loci with pulse rate. Only variants with cluster inclusion probability greater than 0.7 are included. Top left, CWAS loci (sentinels). Top right, overlapping CWAS and atrial fibrillation GWAS loci. Bottom left, all atrial fibrillation GWAS loci from Nielsen et al. (with zoomed inset). Bottom right, all atrial fibrillation GWAS loci with permuted pulse (null, with zoomed inset).
Extended Data Fig. 1
Extended Data Fig. 1
UKB and FG study overview.
Extended Data Fig. 2
Extended Data Fig. 2. Case count comparison between UKB and FG across disease groups.
Diseases within each group are listed in Supplementary Table 2. Only cases >100 in UKB/FG are included. R: Spearman’s correlation for FG R5 (red) and R6 (blue).
Extended Data Fig. 3
Extended Data Fig. 3. Distribution of variant annotation categories.
Left: all variants tested. Right: variants with at least 1 significant association (p < 5 × 10−8). pLOF: predicted loss of function. LC: low confidence loss of function.
Extended Data Fig. 4
Extended Data Fig. 4. Inflation factors and FG-UKB effect size comparisons.
(a) Distribution of inflation factors of CWAS meta-analysis. (b) Effect size comparison between UKB and FG. Inset: zoomed in on small effect sizes. R: Spearman’s rank correlation (two-sided test), p = 4.4 × 10−351.
Extended Data Fig. 5
Extended Data Fig. 5. Surface plot of effects of cohort specific allele enrichment on inverse variant weighted meta-analysis z-scores (IVW uplift) across MAFs (up to MAF 1%).
Uplift is defined as the ratio of meta-analysed IVW Z-score to the Z-score of an individual study. (a) Theoretically predicted IVW uplift. (b) Observed IVW uplift. (c) Median absolute relative error (MARE, %) between simulated and theoretical IVW uplift values. For each combination of MAF and allelic enrichment, we simulated 1000 datasets for two binary variables reflecting disease status for two studies. Study sample size and disease prevalence were fixed (matching values estimates from UKB and FG), genomic effects were randomly sampled from the set of positive effect sizes in UKB and FG (Supplementary Table 3), MAF was varied from 0.01% to 1% and allele enrichment (in the smaller study) ranged from 1 to 50. (d) Comparison of Z-scores for randomly subsetted UKB data meta-analysed with FinnGen (UKBxFG) against subsetted UKB meta-analysed with sample size matched UKB cohort (UKBxUKB') across allele fold enrichments for sentinel associations (Supplementary Table 3) with MAF<0.1. Y-axis (log10(FG/UKB meta Z ratio): log10(ZUKBxFG/ZUKBxUKB'). X-axis (MAF enrichment ratio): allelic fold enrichment (FE) where pink side denotes greater enrichment in UKB, blue side denotes greater enrichment in FG. Each box plot presents the median, first and third quartiles, with upper and lower whiskers representing 1.5x inter-quartile range above and below the third and first quartiles respectively. N for each boxplot from left to right: 88, 99, 212, 453, 213, 71, 126, 158.
Extended Data Fig. 6
Extended Data Fig. 6. Histogram of genes with associations, disease and biomarker associations per region.
(a) Number of genes with coding associations per region. Each disease cluster counted separately. MHC region excluded. (b) Number of associated trait clusters (p < 5 × 10−8) per region. Inset shows zoomed in x-scale between 0-12 trait cluster associations per region. (c) Number of associated biomarker groups per locus (p< 1 × 10−6). MHC: Major Histocompatibility Complex.
Extended Data Fig. 7
Extended Data Fig. 7
(a) Simplified diagram of the coagulation cascade. Factors (in roman numerals, “a” represents activated) with genetic association with PE highlighted in orange. Blue line (round end) indicates inhibitory effect of APC on VIIIa and Va. (b) Schematic of potential pathway from missense variants in F5 and F10 to PE risk. Factor V Leiden variant had null associations with F5 levels (βF5 levels=0.21, p = 0.091). Dashed blue lines suggest effect of the variants on PE risk which we assume under MR framework acts through factor levels (solid blue lines). Grey box and arrows represent known pathway for Factor V Leiden mutation. GOF: Gain of function, APC: Activated protein C, MR: Mendelian randomisation, PE: Pulmonary embolism.
Extended Data Fig. 8
Extended Data Fig. 8. In vitro functional effects of the PITX2c Pro41Ser variant (rs143452464).
(a) Schematic of the location of the Pro41Ser variant in PITX2c as compared with the PITX2a splicing isoform. Numbers below each row indicate AA number from N-terminal (left) to C-terminal (right). AD1: common sequence, HD: homeodomain, ID1: transcriptional inhibitory domain 1, AD2: second common sequence, ID2: transcriptional inhibitory domain 2. Pro41Ser lies within the terminal domain (grey), near the 5-amino acid LAMAS (single amino acid code) sequence (33 to 37 red), which is important for transcriptional activity of the N-terminal of PITX2c. (b) Reporter gene assays: TM-1 cells (Transformed human trabecular meshwork cells) were co-transfected with a firefly luciferase reporter plasmid containing a PITX2c binding element, a β-galactosidase control vector and expression vector for PTIX2c (wild-type, Pro41Ser (P41S) or empty control vector (EV)). The activity of firefly luciferase upon activation by PITX2c (n = 3 transfections per condition) was normalized to β-galactosidase. Experiments with a truncated reporter construct ("−163/+165Δ"), containing a deletion of 8bp within the predicted PITX2 binding site, are shown as additional control. Data are presented as mean values +/− SEM, unadjusted p-value derived from two-sided t-test (EVΔ vs EV: 0.122; WTΔ vs WT: 0.074; P41SΔ vs P41S: 0.018; WT vs EV: 0.015; P41S vs WT: 0.0056; WTΔ vs EVΔ: 0.002; P41SΔ vs WTΔ: 0.0005).

References

    1. Claussnitzer M, et al. A brief history of human disease genetics. Nature. 2020;577:179–189. - PMC - PubMed
    1. Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. - PMC - PubMed
    1. Van Hout CV, et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature. 2020;586:749–756. - PMC - PubMed
    1. Szustakowski JD, et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 2021;53:942–948. - PubMed
    1. Wang Q, et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature. 2021;597:527–532. - PMC - PubMed

Publication types