Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep;56(9):1811-1820.
doi: 10.1038/s41588-024-01894-5. Epub 2024 Aug 29.

Rare coding variant analysis for human diseases across biobanks and ancestries

Affiliations

Rare coding variant analysis for human diseases across biobanks and ancestries

Sean J Jurgens et al. Nat Genet. 2024 Sep.

Abstract

Large-scale sequencing has enabled unparalleled opportunities to investigate the role of rare coding variation in human phenotypic variability. Here, we present a pan-ancestry analysis of sequencing data from three large biobanks, including the All of Us research program. Using mixed-effects models, we performed gene-based rare variant testing for 601 diseases across 748,879 individuals, including 155,236 with ancestry dissimilar to European. We identified 363 significant associations, which highlighted core genes for the human disease phenome and identified potential novel associations, including UBR3 for cardiometabolic disease and YLPM1 for psychiatric disease. Pan-ancestry burden testing represented an inclusive and useful approach for discovery in diverse datasets, although we also highlight the importance of ancestry-specific sensitivity analyses in this setting. Finally, we found that effect sizes for rare protein-disrupting variants were concordant between samples similar to European ancestry and other genetic ancestries (βDeming = 0.7-1.0). Our results have implications for multi-ancestry and cross-biobank approaches in sequencing association studies for human disease.

PubMed Disclaimer

Conflict of interest statement

Competing Interests

P.T.E. has received sponsored research support from IBM Health, Bayer AG, Bristol Myers Squibb, and Pfizer; he has consulted for Bayer AG. S.A.L. is a full-time employee of Novartis Institutes of BioMedical Research as of July 18, 2022. S.A.L. previously received sponsored research support from Bristol Myers Squibb, Pfizer, Boehringer Ingelheim, Fitbit, Medtronic, Premier, and IBM, and has consulted for Bristol Myers Squibb, Pfizer, Blackstone Life Sciences, and Invitae. P.N. has received sponsored research support from Amgen, Apple, Boston Scientific, Novartis, and AstraZeneca, personal fees from Apple, AstraZeneca, Genentech / Roche, Novartis, Allelica, Foresite Labs, Blackstone Life Sciences, and HeartFlow, is a scientific advisory board member of Esperion Therapeutics, geneXwell, and TenSixteen Bio, is a scientific co-founder of TenSixteen Bio, and spousal employment at Vertex, all unrelated to the present work. B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. The remaining authors declare no competing interests.

Figures

Figure 5
Figure 5
Figure 1
Figure 1. Study overview for rare variant discovery across human disease.
Three studies were included in the analysis: All of Us (AoU) with whole-genome sequence data, UK Biobank (UKB) with exome sequencing data, and Mass General Brigham Biobank (MGB) with exome sequence data. Over 600 disease Phecodes were identified using a hierarchal clustering algorithm. Disease Phecodes were analyzed using exome-wide gene-based testing of rare genetic variants using three masks (LOF, LOF+missense and ultra-rare missense) after which P-values were combined into a single P-value using the Cauchy distribution for each gene-disease pair.
Figure 2
Figure 2. Multi-ancestry meta-analysis of rare genetic variation across three sequenced biobanks in over 750,000 individuals identifies 363 rare variant associations.
Panel a shows a stacked bar chart with the proportion of each continental ancestry on the y-axis and dataset on the x-axis. Ancestral diversity was largest in AoU. Panel b is a violin plot with overlaid boxplot showing the prevalence of Phecodes on the y-axis and each dataset on the x-axis. Plotted Phecodes were those included in the analysis with at least 50 cases in each dataset (N=546). Panel c is a stacked bar chart showing the number of identified disease associations (Cauchy Q<0.01) on the y-axis and each dataset and x-axis, as well as the meta-analysis results. Bars are stacked by the class of mask that yielded the lowest P-value (from LOF masks, LOF+missense masks, and ultra-rare missense masks). Panel d is a multi-trait gene-based Manhattan plot highlighting results from the overall meta-analysis, each dot representing one gene-trait test, with the -log10 of the Cauchy P-value on the y-axis and different disease categories on the x-axis. For disease categories with strong associations, the top three non-redundant genes are annotated with the gene names. Panel e is a violin plot with overlaid boxplot showing the distribution of inflation factors by phenotype (λ estimated at 95 percentile) on the y-axis, and different rare variant masks on x-axis, as well as the distribution for the Cauchy combination results (on the far right). Dotted lines show the 0.75 and 1.25 cutoffs for inflation factor on the y-axis. The number of phenotypes is 601 in all violins. Similarly, panel f shows the distribution of inflation factors by gene across the different masks and for the Cauchy combination results (on the far right), where the number of genes equals 14,388, 15,529, 17,809, 16,742, 15,462, 18,238, and 18,456. Note: Cauchy P-values represent the omnibus P-value of all masks for a gene-phecode pair (unadjusted for multiple testing) after combining them using the Cauchy distribution. The Cauchy Q-values represent the Benjamini-Hochberg FDR adjustments of these Cauchy P-values. P-values for mask-phecode pairs - prior to the Cauchy combination - were derived from Z-score-based meta-analyses of score tests from logistic mixed-effects models with saddle-point-approximation. All statistical tests and P-values are two-sided. All boxplots show median (center), 25th percentile (bottom of box), 75th percentile (top of box), smallest/largest value within 1.5*inter quartile range from hinge (bottom/top whiskers, respectively), and data points outside of this range (dots). Abbreviations: EUR, European ancestry; AFR, African ancestry; AMR, Admixed American Ancestry; SAS, South-Asian ancestry; EAS, East-Asian ancestry; UND, undefined ancestry; LOF, loss-of-function.
Figure 3
Figure 3. Assessment of bias from inclusion of non-European samples among the significant associations.
Panel a shows a scatter plot with each dot representing a gene-phecode pair that reached test-wide significance in our primary analysis (Q<0.01), with - log10(PCauchy) from the primary analysis on the x-axis, and the -log10(PCauchy) derived from a European ancestry sensitivity analysis on the y-axis (both log-transformed for clarity). Specific cutoffs on the y-axis are highlighted using dotted lines. Any strong deviation of P-values could indicate bias in our multi-ancestry approach, or alternatively indicate markedly lower power among European samples. No associations were abolished when restricting to European samples. There were 6 additional strongly attenuated genes (0.05>PEUR>0.0005). Among these, several represent known gene-disease links (Supplementary Note). Panel b shows a scatter plot with the effect sizes for significant associations from the primary analysis on the x-axis, with the effect sizes from European-only sensitivity analyses on the y-axis. The effect size for the most significant mask is plotted for each gene-phecode pair, restricting to masks that had adequate allele counts in both the primary analysis and in the sensitivity analysis (cMAC≥20). Any large deviations from the dotted line (x=y) indicate bias from our multi-ancestry approach. Strikingly, no strong deviations of effect sizes were observed in this sensitivity analysis. For 8 associations there were insufficient alleles among European ancestry samples to compute an effect size, although represented well known gene-disease links (Supplementary Note). Taken together, these results show that the bias from inclusion of non-European samples was not substantial. Note: Bias is defined here as the spurious change in effect sizes / test statistics that is caused by inclusion of multiple ancestries but is not caused by true biological differences. Cauchy P-values represent the omnibus P-value of all masks for a gene-phecode pair (unadjusted for multiple testing) after combining them using the Cauchy distribution. The Cauchy Q-values represent the Benjamini-Hochberg FDR adjustments of these P-values. P-values for mask-phecode pairs (prior to the Cauchy combination) were derived from Z-score-based meta-analyses of score tests from logistic mixed-effects models with saddle-point-approximation. All statistical tests and P-values are two-sided. ORs were estimated using inverse-variance-weighted meta-analysis of two-sided Firth’s logistic regression results. Abbreviations: ALL, all ancestry individuals; EUR, European ancestry individuals; OR, odds ratio.
Figure 4
Figure 4. Large genetic effect sizes and pleiotropic associations identify core genes for the human disease phenome.
Panel a shows stacked bar charts for all genes from the meta-analysis that showed at least 3 associations, with the number of associations on the y-axis and gene on the x-axis. Bars are stacked by the class of the best mask (LOF, LOF+missense masks, or ultra-rare missense masks) for each gene-trait association. Panel b represents grouped boxplots showing rare variant effect size distributions per Phecode category, with log-scaled ORs on the y-axis and categories on the x-axis. The figure is restricted to gene-trait associations reaching Cauchy Q<0.01 and restricting to rare variant masks with P<2.6x10-6. Per category, only masks with at least 7 associations are shown (and therefore some categories do not show all masks and not all categories are plotted). In box plots, the number of contributing associations from left to right equals 53, 44, 23, 15, 9, 16, 37, 30, 9, 15, 8, 22, 32, 54, 58, 18, 9, 9, and 8. All boxplots show median (center), 25th percentile (bottom of box), 75th percentile (top of box), smallest/largest value within 1.5*inter quartile range from hinge (bottom/top whiskers, respectively), and data points outside of this range (dots). Panel c represents a multiple jittered lollipop chart showing rare variant effect sizes for each Phecode category. The x-axis shows the log-scaled OR with each dot representing an association (restricting to gene-trait associations with Cauchy Q<0.01 and rare variant masks with P<2.6x10-6), and Phecode categories on the y-axis. Horizontal lines start at 1 and end at the largest estimated effect size within the category. Dots are colored by class of rare variant mask. Select genes are annotated within each category to highlight large-effect size genes for the respective category. Note: In all panels, ORs were estimated using inverse-variance-weighted meta-analysis of two-sided Firth’s logistic regression results, while mask-phecode P-values were estimated from Z-score-based meta-analysis of score tests from logistic mixed-effects models with saddle-point-approximation. Cauchy P-values represent the omnibus P-value of all masks for a gene-phecode pair after combining them using the Cauchy distribution (unadjusted for multiple testing), while the Cauchy Q-values represent the Benjamini-Hochberg FDR adjustments of these P-values. All statistical tests and P-values are two-sided. Abbreviations: LOF, loss-of-function; OR, odds ratio.
Figure 5
Figure 5. Effect sizes of rare coding variants for disease correlate between genetic European and other genetic ancestries.
Figure 5 shows scatter plots with the effect sizes from European-ancestry analyses on the x-axis with the respective effect sizes estimated among individuals dissimilar to European ancestry on the y-axis. In each figure panel, a three-sample design was applied: Significant mask-disease pairs were identified from a EUR meta-analysis of UKB and MGB (significance determined at P<2.6x10-6), after which those mask-disease pairs were assessed within different ancestry groups from the AoU dataset. Each panel shows effect sizes (ie, log[OR]) for EUR analysis on the x-axis and effect sizes from other ancestries on the y-axis; the left panels show EUR versus all non-European samples, the middle panels show EUR vs African ancestry samples, and the right panels show EUR versus Admixed-American samples. Part a shows results for rare LOF variant masks with at least 20 carriers in both ancestry assignments, while part b shows results for ultra-rare missense0.5 variant masks with at least 20 carriers in both ancestry assignments. Linear trend lines from error-in-variable total-least-squares Deming regression are added to the plots. Statistics from Deming regression, including estimated β [95%CI] and P-values, are added in text in the top left corners. A regression coefficient (βsens) and 95%CI is also provided in the bottom right corners, showing results from a combined sensitivity analysis where genes associated with age or leukemic outcomes are removed, and where analyses are adjusted for quantiles of effective sample size (Supplementary Table 14). Note: All ORs were estimated using Firth’s logistic regression models among unrelated participants. Deming regression was run using beta coefficients and their standard errors, making the analysis comparable to York regression with the assumption of uncorrelated errors. Standard errors were computed using Jackknife estimators. All statistical tests and P-values are two-sided. Abbreviations: LOF, loss-of-function; OR, odds ratio; EUR, European ancestry; AFR, African ancestry; AMR, Admixed American ancestry.; nonEUR, defined ancestry other than European; sens, sensitivity analysis.
None
None
None
None
None
None

References

    1. Backman JD, et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature. 2021;599:628–634. doi: 10.1038/s41586-021-04103-z. - DOI - PMC - PubMed
    1. Taliun D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. - DOI - PMC - PubMed
    1. Wang Q, et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature. 2021 doi: 10.1038/s41586-021-03855-y. - DOI - PMC - PubMed
    1. Karczewski KJ, et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genomics. 2022;2 doi: 10.1016/j.xgen.2022.100168. - DOI - PMC - PubMed
    1. Jurgens SJ, et al. Analysis of rare genetic variation underlying cardiometabolic diseases and traits among 200,000 individuals in the UK Biobank. Nat Genet. 2022;54:240–250. doi: 10.1038/s41588-021-01011-w. - DOI - PMC - PubMed

Methods-only references

    1. Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. - DOI - PMC - PubMed
    1. Sudlow C, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. - DOI - PMC - PubMed
    1. Szustakowski JD, et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat Genet. 2021;53:942–948. - PubMed
    1. Cronin RM, et al. Development of the Initial Surveys for the All of Us Research Program. Epidemiology. 2019;30:597–608. doi: 10.1097/EDE.0000000000001028. - DOI - PMC - PubMed
    1. Karlson EW, Boutin NT, Hoffnagle AG, Allen NL. Building the Partners HealthCare Biobank at Partners Personalized Medicine: Informed Consent, Return of Research Results, Recruitment Lessons and Operational Considerations. J Pers Med. 2016;6 doi: 10.3390/jpm6010002. - DOI - PMC - PubMed

LinkOut - more resources