Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun;570(7759):71-76.
doi: 10.1038/s41586-019-1231-2. Epub 2019 May 22.

Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls

Jason Flannick  1   2   3   4 Josep M Mercader #  5   6   7   8   9 Christian Fuchsberger #  10   11   12 Miriam S Udler #  5   6   7   8   9 Anubha Mahajan #  13   14 Jennifer Wessel  15   16   17 Tanya M Teslovich  18 Lizz Caulkins  5   6 Ryan Koesterer  5   6 Francisco Barajas-Olmos  19 Thomas W Blackwell  10   12 Eric Boerwinkle  20   21 Jennifer A Brody  22 Federico Centeno-Cruz  19 Ling Chen  8   9 Siying Chen  10   12 Cecilia Contreras-Cubas  19 Emilio Córdova  19 Adolfo Correa  23 Maria Cortes  24 Ralph A DeFronzo  25 Lawrence Dolan  26 Kimberly L Drews  27 Amanda Elliott  5   6   8   9 James S Floyd  28 Stacey Gabriel  24 Maria Eugenia Garay-Sevilla  29   30 Humberto García-Ortiz  19 Myron Gross  31 Sohee Han  32 Nancy L Heard-Costa  33   34 Anne U Jackson  10   12 Marit E Jørgensen  35   36   37 Hyun Min Kang  10   12 Megan Kelsey  27 Bong-Jo Kim  32 Heikki A Koistinen  38   39   40 Johanna Kuusisto  41   42 Joseph B Leader  43 Allan Linneberg  44   45   46 Ching-Ti Liu  47 Jianjun Liu  48   49   50 Valeriya Lyssenko  51   52 Alisa K Manning  9   53 Anthony Marcketta  18 Juan Manuel Malacara-Hernandez  29   30 Angélica Martínez-Hernández  19 Karen Matsuo  10   12 Elizabeth Mayer-Davis  54 Elvia Mendoza-Caamal  19 Karen L Mohlke  55 Alanna C Morrison  20 Anne Ndungu  13 Maggie C Y Ng  56   57   58 Colm O'Dushlaine  18 Anthony J Payne  13 Catherine Pihoker  59 Broad Genomics PlatformWendy S Post  60 Michael Preuss  61 Bruce M Psaty  62   63   64   65   66 Ramachandran S Vasan  34   67 N William Rayner  13   14   68 Alexander P Reiner  65 Cristina Revilla-Monsalve  69 Neil R Robertson  13   14 Nicola Santoro  70 Claudia Schurmann  61 Wing Yee So  71   72   73 Xavier Soberón  19 Heather M Stringham  10   12 Tim M Strom  74   75 Claudia H T Tam  71   72   73 Farook Thameem  76 Brian Tomlinson  71 Jason M Torres  13 Russell P Tracy  77   78 Rob M van Dam  49   50   79 Marijana Vujkovic  80 Shuai Wang  47 Ryan P Welch  10   12 Daniel R Witte  81   82 Tien-Yin Wong  83   84   85 Gil Atzmon  86   87   88 Nir Barzilai  86   88 John Blangero  89   90 Lori L Bonnycastle  91 Donald W Bowden  56   57   58 John C Chambers  92   93   94 Edmund Chan  49 Ching-Yu Cheng  95 Yoon Shin Cho  96 Francis S Collins  91 Paul S de Vries  20 Ravindranath Duggirala  89   90 Benjamin Glaser  97 Clicerio Gonzalez  98 Ma Elena Gonzalez  99 Leif Groop  51   100 Jaspal Singh Kooner  101 Soo Heon Kwak  102 Markku Laakso  41   42 Donna M Lehman  25 Peter Nilsson  103 Timothy D Spector  104 E Shyong Tai  49   50   84 Tiinamaija Tuomi  100   105   106   107 Jaakko Tuomilehto  108   109   110   111 James G Wilson  112 Carlos A Aguilar-Salinas  113 Erwin Bottinger  61 Brian Burke  27 David J Carey  43 Juliana C N Chan  71   72   73 Josée Dupuis  34   47 Philippe Frossard  114 Susan R Heckbert  62   65 Mi Yeong Hwang  32 Young Jin Kim  32 H Lester Kirchner  43 Jong-Young Lee  115 Juyoung Lee  32 Ruth J F Loos  61   116 Ronald C W Ma  71   72   73 Andrew D Morris  117 Christopher J O'Donnell  118   119   120   121 Colin N A Palmer  122 James Pankow  123 Kyong Soo Park  101   124   125 Asif Rasheed  114 Danish Saleheen  80   114 Xueling Sim  50 Kerrin S Small  104 Yik Ying Teo  50   126   127 Christopher Haiman  128 Craig L Hanis  129 Brian E Henderson  128 Lorena Orozco  19 Teresa Tusié-Luna  113   130 Frederick E Dewey  18 Aris Baras  18 Christian Gieger  131   132 Thomas Meitinger  74   75   133 Konstantin Strauch  131   134 Leslie Lange  135 Niels Grarup  136 Torben Hansen  136   137 Oluf Pedersen  136 Philip Zeitler  138 Dana Dabelea  139 Goncalo Abecasis  10   12 Graeme I Bell  29   30 Nancy J Cox  140 Mark Seielstad  141   142 Rob Sladek  143   144   145 James B Meigs  24   9   146 Steve S Rich  147 Jerome I Rotter  148   149   150 DiscovEHR CollaborationCHARGELuCampProDiGYGoT2DESPSIGMA-T2DT2D-GENESAMP-T2D-GENESDavid Altshuler  5   6   8   9   151   152   153 Noël P Burtt  5   6 Laura J Scott  10   12 Andrew P Morris  13   154 Jose C Florez  5   6   7   8   9 Mark I McCarthy  13   14   155 Michael Boehnke  10   12
Affiliations

Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls

Jason Flannick et al. Nature. 2019 Jun.

Abstract

Protein-coding genetic variants that strongly affect disease risk can yield relevant clues to disease pathogenesis. Here we report exome-sequencing analyses of 20,791 individuals with type 2 diabetes (T2D) and 24,440 non-diabetic control participants from 5 ancestries. We identify gene-level associations of rare variants (with minor allele frequencies of less than 0.5%) in 4 genes at exome-wide significance, including a series of more than 30 SLC30A8 alleles that conveys protection against T2D, and in 12 gene sets, including those corresponding to T2D drug targets (P = 6.1 × 10-3) and candidate genes from knockout mice (P = 5.2 × 10-3). Within our study, the strongest T2D gene-level signals for rare variants explain at most 25% of the heritability of the strongest common single-variant signals, and the gene-level effect sizes of the rare variants that we observed in established T2D drug targets will require 75,000-185,000 sequenced cases to achieve exome-wide significance. We propose a method to interpret these modest rare-variant associations and to incorporate these associations into future target or gene prioritization efforts.

PubMed Disclaimer

Conflict of interest statement

P.Z. is a consultant for Merck, Daichii-Sankyo, Boerhinger-Ingelheim and Janssen; B.M.P. serves on the DSMB of a clinical trial funded by Zoll LifeCor and on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson.

Figures

Fig. 1
Fig. 1. Exome-wide association analysis.
a, Single-variant associations were calculated using the two-sided EMMAX test. Red line, P = 4.3 × 10−7. b, Gene-level P values, corrected for four analyses performed using the two-sided minimum P-value test. The most-significant genes are labelled. Red line, P = 6.5 × 10−7. c, SLC30A8 gene-level P values (left y axis, black line), calculated using a two-sided burden test after removing variants (in order of increasing single-variant P value) from the 1/5 1% mask (the strongest signal for SLC30A8). Dashed line, P = 0.05. Right y axis (blue line), estimated effect size (log10(odds ratio)). Blue shading, 95% confidence interval. Dotted line, effect size = 0. Single-variant n = 45,231 individuals. Gene-level n = 43,074 unrelated individuals.
Fig. 2
Fig. 2. Gene set analysis.
ae, Rank percentiles (1 = highest) for gene-level associations (compared to matched genes) within 11 monogenic diabetes genes (548 matched genes) (a), 8 T2D drug targets (400 matched genes) (b), 31 genes linked to NIDD in mice (1,499 matched genes) (c), 323 genes linked to impaired glucose tolerance in mice (10,043 matched genes) (d) and 11 genes with common likely causal coding variants (537 matched genes) (e). P values are from a one-sided Wilcoxon rank-sum test between each gene set and comparison set. Labels indicate minimum, 25th percentile, median, 75th percentile and maximum. f, Estimated odds ratios, from the weighted burden test of the 5/5 mask, for 8 T2D drug targets (red, agonists; blue, inhibitors). Data are mean ± s.e. of log(odds ratio) from the burden test. n = 43,071 unrelated individuals.
Fig. 3
Fig. 3. Comparison of exome-sequencing to array-based GWAS.
a, Single-variant associations, calculated by two-sided Firth logistic regression, from an array-based imputed GWAS of the subset (n = 34,529) of samples in the exome-sequencing analysis for which array data were available. Labels and axes as described in Fig. 1a. b, The observed liability variance explained (LVE) by the top 19 exome-sequencing gene-level associations (red) and the top 19 imputed GWAS single-variant associations (maximum of 1 per 250 kb; blue) and their ratio (black). Signals ranked by liability variance explained. c, Gene rank percentiles from exome-sequencing gene-level analysis (x axis) and a previous multi-ancestry T2D GWAS (y axis). Genes are from the mouse NIDD gene set (Fig. 2c).
Fig. 4
Fig. 4. Decision support from exome-sequencing data.
a, Estimated power, as a function of future sample size (x axis), to detect T2D gene-level associations (two-sided test at P = 6.25 × 10−7) with aggregate frequency and odds ratios equal to those estimated from our analysis of 8 T2D drug targets (Fig. 2f). b, A proposed workflow for using exome-sequencing data in gene characterization. Depending on the prior belief in the disease relevance of a gene, the cost of experimental characterization and the benefit of gene validation, a decision to conduct the experiment could be informed by the posterior probability of the disease relevance of the gene, as estimated from exome-sequencing association statistics (available through http://www.type2diabetesgenetics.org/).
Extended Data Fig. 1
Extended Data Fig. 1. Power analysis.
The power to detect associations (using a two-sided test) at P < 5 × 10−8 for variants (or collections of variants) with a given minor allele frequency (x axis) and odds ratio (y axis) measured as the average across all ancestries. a, Cells are shaded according to the power of the current study of 20,791 T2D cases and 24,440 controls, with white indicating high power and red indicating low power. The 20%, 50%, 80% and 99% contour lines are labelled. b, Cells are shaded according to the difference in power between the current study and a previously published study of 12,940 individuals, with yellow–white indicating a large increase in power and red indicating a small increase in power. The 20%, 40%, 60% and 80% contour lines are labelled.
Extended Data Fig. 2
Extended Data Fig. 2. Data quality control workflow.
A schematic of the steps involved in sample and variant quality control is shown. Quality control was conducted as described in the Methods to construct a set of samples and variants included in the association analysis. Each step is depicted as an arrow, with the number of samples or variants excluded by the step shown at the end of the arrow. The final set of samples and variants analysed are represented by the ‘Analysis’ dataset; we further excluded samples of high relatedness to other samples in the dataset from some but not all analyses. After each step that removed samples, we also removed newly monomorphic variants (hence the decrease in variants between the ‘Clean’ and ‘Analysis’ datasets).
Extended Data Fig. 3
Extended Data Fig. 3. Single-variant association analysis workflow.
A schematic of the steps involved in single-variant exome-sequencing association analysis is shown, as described in the Methods. We began analysis with a division of samples in the ‘Analysis’ dataset (leftmost column) into 25 different subgroups (second column from the left) based on cohort, ancestry and sequencing technology. We then filtered variants according to metrics computed separately for each subgroup; we applied the filters listed in the ‘Basic filters’ box to all subgroups and for some subgroups we applied additional (more stringent) filters as indicated by boxes in the third column from the left. The resulting number of variants and samples advanced for analysis in each subgroup are indicated in the fourth column from the left. We analysed each subgroup with both the EMMAX test (to measure association strength) and the Firth test (to measure allelic odds ratios), each of which are two-sided; the number of principal components included as covariates in the Firth test is shown in the fifth column from the left. Finally, we combined each of the EMMAX and Firth subgroup-level results using a 25-group meta-analysis to produce the final P values and odds ratios reported for each variant. Multi, variant is multiallelic; CR, call rate; P, variant subgroup-level P value; P(Fisher), variant subgroup-level P value from Fisher’s exact test; P(miss), P value for subgroup-level variant differential missingness between T2D cases and controls; P(HWE), P value for deviation from subgroup-level Hardy–Weinberg equilibrium; Alt GQ, mean genotype quality of non-reference genotypes (across all samples); X Chrom, variant is on X chromosome.
Extended Data Fig. 4
Extended Data Fig. 4. Calibration of single-variant analysis.
To assess whether our single-variant association statistics (two-sided, calculated by the EMMAX test) were well-calibrated, we computed quantile–quantile plots of associations across all samples (Overall) and within each ancestry (total n = 45,231 individuals). To avoid deflation of the quantile–quantile plot from rare variants (for which the expected P values are discrete rather than uniformly distributed), only variants with minor allele counts of 20 or greater (either overall or within the relevant ancestry) are shown. Variants were also LD-pruned before plotting, to avoid induced variance from correlated P values of these variants, using the ‘clump’ method implemented in PLINK. The λ values indicate genomic control, as measured by the ratio in observed median χ2 statistic to that expected under the null hypothesis. Red line, expectation of P values under the null distribution. Blue lines (and grey region), 95% confidence interval of expectations under the null distribution.
Extended Data Fig. 5
Extended Data Fig. 5. Gene-level association analysis workflow.
A schematic of the steps involved in gene-level exome-sequencing association analysis, as described in Methods, is shown. We began analysis with subgroup-level genotype filtering (second column from the left) of unrelated samples in the ‘Analysis’ dataset (leftmost column); we then applied genotype filters for each subgroup (filtering genotypes for either all or no samples in each subgroup), similar to those used in subgroup-level single-variant analyses. We then annotated each non-reference variant allele with 16 different bioinformatics algorithms to assess allele deleteriousness, and we grouped alleles into one of seven nested masks (third column from the left; the number of variants and weights shown correspond to alleles absent from ‘higher’, or more stringent, nested masks). We computed burden and SKAT analyses (both of which are two-sided) using one of two approaches to combine alleles across masks (Methods): first, by analysing all alleles at once with weights assigned according to the most stringent mask containing the allele (weighted test); and second, by analysing each mask independently and then calculating the lowest P value corrected for the effective number of tests (minimum P-value test). Multi, variant is multiallelic; CR, call rate; P(miss), P value for subgroup-level variant differential missingness between T2D cases and controls; P(HWE), P value for deviation from subgroup-level Hardy–Weinberg equilibrium; Alt GQ, mean genotype quality of non-reference genotypes (across all samples).
Extended Data Fig. 6
Extended Data Fig. 6. Calibration of gene-level association analyses.
For both the burden and SKAT tests, we tested for gene-level association within seven different allelic masks. As this produced seven P values for each test, we developed two means to consolidate these results (Methods). a, b, The quantile–quantile plots of associations are shown for the minimum P-value burden test (a) and the weighted burden test (b). Each test is two-sided and consists of n = 43,071 unrelated individuals. Only genes with combined minor allele count of 20 or greater are shown in the quantile–quantile plots, to avoid deflation from genes with too few variants to produce P values asymptotically uniform under the null distribution. The λ values indicate genomic control, as measured by the ratio in observed median χ2 statistic to that expected under the null hypothesis. The three genes with exome-wide significant associations are labelled. Red line, expectation of P values under the null distribution. Blue lines (and grey region), 95% confidence interval of expectations under the null distribution.
Extended Data Fig. 7
Extended Data Fig. 7. PPA calculation workflow.
a, We estimated the PPAs for nonsynonymous variants in our sequence analysis based on concordance with independent exome array data and previously published,, estimates of the fraction of causal coding associations (Methods). b, PPA estimates for nonsynonymous variants within T2D GWAS loci are shown as a function of P value (right y axis, black line; 95% confidence interval, grey shading) together with the total number of such variants (left y axis, red line). c, For variants outside of T2D GWAS loci, we developed a method to further compute Bayes factors, which measure the odds of true and causal association, as a function of P value, using a model of the prior odds of true and causal association for variants in GWAS loci (Methods). d, These Bayes factors can be combined with a subjective prior belief in the T2D relevance of a gene (y axis) to produce the estimated posterior probability of true and causal association for any nonsynonymous variant in the exome-sequence dataset based on its observed log10(P) (x axis). Posterior estimates are shaded proportional to value (red, low; white, high). Values are shown for the default modelling assumptions of 33% of missense variants that caused gene inactivation and 30% of true missense associations that represented the causal variant.
Extended Data Fig. 8
Extended Data Fig. 8. Estimated posterior probability of associations for different prior hypotheses.
We estimated the posterior probability of association for nonsynonymous variants that met various single-variant P-value thresholds (two-sided EMMAX test, n = 45,231 individuals) in our analysis, as described in the Methods and shown in Extended Data Fig. 7. To perform the needed calculations, we assumed that, on average, 1.1 genes that are found within each T2D GWAS locus are relevant to T2D and 33% of missense mutations within these genes cause gene loss-of-function. af, To assess the sensitivity of our analysis to these assumptions, we repeated the calculations with different assumptions of 0.5 (a), 2.0 (b), 0.25 (c) and 0.1 (d) T2D-relevant genes within each GWAS locus, as well as 25% (e) and 40% (f) of missense variants leading to loss-of-function. All analyses assume the default modelling parameters that 30% of true nonsynonymous associations are causal associations; different values for this parameter would scale posterior probability estimates linearly.

Comment in

References

    1. Altshuler D, Daly MJ, Lander ES. Genetic mapping in human disease. Science. 2008;322:881–888. doi: 10.1126/science.1156409. - DOI - PMC - PubMed
    1. Welter D, et al. The NHGRI GWAS Catalog, a curated resource of SNP–trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229. - DOI - PMC - PubMed
    1. Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169:1177–1186. doi: 10.1016/j.cell.2017.05.038. - DOI - PMC - PubMed
    1. Grotz AK, Gloyn AL, Thomsen SK. Prioritising causal genes at type 2 diabetes risk loci. Curr. Diab. Rep. 2017;17:76. doi: 10.1007/s11892-017-0907-y. - DOI - PMC - PubMed
    1. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 2010;11:415–425. doi: 10.1038/nrg2779. - DOI - PubMed

Grants and funding