Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug;53(8):1260-1269.
doi: 10.1038/s41588-021-00892-1. Epub 2021 Jul 5.

Whole-exome imputation within UK Biobank powers rare coding variant association and fine-mapping analyses

Affiliations

Whole-exome imputation within UK Biobank powers rare coding variant association and fine-mapping analyses

Alison R Barton et al. Nat Genet. 2021 Aug.

Abstract

Exome association studies to date have generally been underpowered to systematically evaluate the phenotypic impact of very rare coding variants. We leveraged extensive haplotype sharing between 49,960 exome-sequenced UK Biobank participants and the remainder of the cohort (total n ≈ 500,000) to impute exome-wide variants with accuracy R2 > 0.5 down to minor allele frequency (MAF) ~0.00005. Association and fine-mapping analyses of 54 quantitative traits identified 1,189 significant associations (P < 5 × 10-8) involving 675 distinct rare protein-altering variants (MAF < 0.01) that passed stringent filters for likely causality. Across all traits, 49% of associations (578/1,189) occurred in genes with two or more hits; follow-up analyses of these genes identified allelic series containing up to 45 distinct 'likely-causal' variants. Our results demonstrate the utility of within-cohort imputation in population-scale genome-wide association studies, provide a catalog of likely-causal, large-effect coding variant associations and foreshadow the insights that will be revealed as genetic biobank studies continue to grow.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Whole-exome imputation, association, and fine-mapping identify rare coding variants likely to causally associate with 54 quantitative traits.
Imputation panel coverage (a) and imputation accuracy (b) assessed using SNP calls from the second release of UK Biobank whole exome-sequencing data (N=200,643; accuracy benchmarks excluded individuals in the initial release). Data are presented as mean values. Error bars, 95% CIs. (c) Schematic of our analytical pipeline, which combined UK Biobank whole-exome sequences with SNP-array genotypes to impute exome-wide genotypes into the full cohort. We analyzed imputed exome variants together with the genome-wide UK Biobank imputation release to find significant variant-trait associations independent of neighboring variants, and we restricted to rare (MAF<0.01) protein-altering variants with CADD ≥ 20 or SpliceAI support to form a final list of likely-causal variants. (d) Distribution of first UK Biobank genetic data set in which each association could have been detected. Roughly one-third of all likely-causal variants – and nearly all very rare likely-causal variants – were only discoverable using WES imputation. (e) WES imputation enabled identification of new rare coding variants for all but one trait (immature reticulocyte fraction) among 54 quantitative traits analyzed.
Figure 2.
Figure 2.. Association analyses of the subsequent N=200,643 UK Biobank exome release demonstrate robustness of likely-causal variant-trait associations ascertained using genotypes imputed from N=49,960 exomes.
For each likely-causal association, we repeated the association analysis (i) restricting to the N=200,643 cohort, but still using imputed genotypes (x-axis); or (ii) restricting to the N=200,643 cohort and using genotypes directly derived from exome sequencing (y-axis). Only 613 of 1,189 likely-causal associations from the imputed N=487,409 data set reached significance (BOLT-LMM P<5 x 10−8; red line in panel b) using the N=200,643 exomes alone. Association test statistics were highly correlated (Pearson R=0.96) between these two approaches. Only 6 associations involving 5 distinct variants (1:120463017:C:T, 2:174130918:G:A, 11:48285468:G:A, 16:2287866:G:A, 20:30610469:G:T) decreased in strength by >2-fold in the direct analysis, potentially due to inaccurate imputation or inaccurate genotyping.
Figure 3.
Figure 3.. Likely-causal coding variants are rare and enriched for deleteriousness.
(a) Likely-causal variants (pink, n=675) had minor allele frequencies distributed relatively evenly across the range under consideration (MAF = 10−5 to 10−2), whereas variants that failed linkage disequilibrium (LD)-based filters (blue, n=898) tended to be less rare. (b) Likely-causal variants had elevated CADD scores compared to those that failed LD-based filters and compared to a randomly-sampled background distribution of rare coding variants (green, n=47,002). (c) Likely-causal variants were enriched for predicted loss-of-function mutations. Bar height represents identified fraction. Error bars estimate sampling uncertainty based on a binomial model, 95% CIs. (d) Likely-causal missense variants were enriched for higher-impact amino acid substitutions (as measured by more negative BLOSUM62 scores).
Figure 4.
Figure 4.. Many genes contain long allelic series of rare coding variants with consistent effect directions.
(a-d) Allelic series of rare coding variants with statistically independent phenotype associations (reaching FDR<0.05 significance) for: (a) PCSK9 and LDL cholesterol, (b) IQGAP2 and mean platelet volume, (c) IFRD2 and high light scatter reticulocyte count, and (d) NPR2 and height. Top, protein structures with altered amino acids (modified by missense variants) color-coded by effect direction (red for trait-increasing variants and blue for trait-decreasing variants). Bottom, per-variant effect sizes (data point represents mean value; error bars, 95% CIs) and allele frequencies. Protein structures were previously determined experimentally (for PCSK9 and IQGAP2) or computationally predicted (for IFRD2 and NPR2). Functional domains of PCSK9 are shaded in different colors. IQGAP2 is represented as a homodimer in its crystal structure. (e) Distributions of effect directions for all gene-trait pairs with 10 or more variants in an allelic series.

References

    1. Low-Frequency and Rare-Coding Variation Contributes to Multiple Sclerosis Risk. Cell 175, 1679–1687.e7 (2018). - PMC - PubMed
    1. Marouli E et al. Rare and low-frequency coding variants alter human adult height. Nature 542, 186–190 (2017). - PMC - PubMed
    1. Liu DJ et al. Exome-wide association study of plasma lipids in >300,000 individuals. Nat. Genet 49, 1758–1766 (2017). - PMC - PubMed
    1. Liu C et al. Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci. Nat. Genet 48, 1162–1170 (2016). - PMC - PubMed
    1. Fu W et al. Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants. Nature 493, 216–220 (2013). - PMC - PubMed

METHODS REFERENCES

    1. Zhou W et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet 50, 1335–1341 (2018). - PMC - PubMed
    1. Chang CC et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, (2015). - PMC - PubMed
    1. Yang J et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet 44, 369–S3 (2012). - PMC - PubMed
    1. Buniello A et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019). - PMC - PubMed
    1. Locke AE et al. Exome sequencing identifies high-impact trait-associated alleles enriched in Finns. bioRxiv 464255 (2019) doi:10.1101/464255. - DOI

Publication types

MeSH terms