Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 10;26(1):50.
doi: 10.1186/s13059-025-03518-5.

Trans-ancestral rare variant association study with machine learning-based phenotyping for metabolic dysfunction-associated steatotic liver disease

Affiliations

Trans-ancestral rare variant association study with machine learning-based phenotyping for metabolic dysfunction-associated steatotic liver disease

Robert Chen et al. Genome Biol. .

Abstract

Background: Genome-wide association studies (GWAS) have identified common variants associated with metabolic dysfunction-associated steatotic liver disease (MASLD). However, rare coding variant studies have been limited by phenotyping challenges and small sample sizes. We test associations of rare and ultra-rare coding variants with proton density fat fraction (PDFF) and MASLD case-control status in 736,010 participants of diverse ancestries from the UK Biobank, All of Us, and BioMe and performed a trans-ancestral meta-analysis. We then developed models to accurately predict PDFF and MASLD status in the UK Biobank and tested associations with these predicted phenotypes to increase statistical power.

Results: The trans-ancestral meta-analysis with PDFF and MASLD case-control status identifies two single variants and two gene-level associations in APOB, CDH5, MYCBP2, and XAB2. Association testing with predicted phenotypes, which replicates more known genetic variants from GWAS than true phenotypes, identifies 16 single variants and 11 gene-level associations implicating 23 additional genes. Two variants were polymorphic only among African ancestry participants and several associations showed significant heterogeneity in ancestry and sex-stratified analyses. In total, we identified 27 genes, of which 3 are monogenic causes of steatosis (APOB, G6PC1, PPARG), 4 were previously associated with MASLD (APOB, APOC3, INSR, PPARG), and 23 had supporting clinical, experimental, and/or genetic evidence.

Conclusions: Our results suggest that trans-ancestral association analyses can identify ancestry-specific rare and ultra-rare coding variants in MASLD pathogenesis. Furthermore, we demonstrate the utility of machine learning in genetic investigations of difficult-to-phenotype diseases in trans-ancestral biobanks.

Keywords: Genetic association studies; Machine learning; Metabolic dysfunction-associated steatotic liver disease.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: In all three biobanks accessed in this study (UK Biobank, All of Us, BioMe), participants voluntarily enrolled and gave informed electronic consent. We accessed UK Biobank data under application ID 16218 and All of Us data (Controlled Tier version 7) under workspace aou-rw-75979bcb. The Institutional Review Board at the Icahn School of Medicine at Mount Sinai approved BioMe access (GCO no. 07–0529; STUDY-11–01139). Competing interests: R.D. reported being a scientific co-founder, consultant and equity holder for Pensieve Health (pending) and being a consultant for Variant Bio and Character Bio. M.B. receives grant support from Pfizer and Histoindex and serves as a consultant for Madrigal, Intercept, Fibronostics, NOVONordisk, GSK, and The Kinetix Group. All other authors have no competing interests to disclose.

Figures

Fig. 1
Fig. 1
Study workflow. Abbreviations: PDFF, proton density fat fraction; MRI, magnetic resonance imaging; PTV, protein-truncating variant
Fig. 2
Fig. 2
Variants and genes identified by true phenotypes across the UK Biobank, All of Us, and BioMe. AB Manhattan plots showing associations from single variant testing (A) and gene-level testing (B). The red dashed lines represent thresholds for exome-wide significance (A; 4.3 × 10−7) and Bonferroni significance (B; 2.7 × 10−6). CD Single variant associations, labeled as chromosome:position:reference allele:effect allele (gene) (effect direction), and gene-level associations, labeled as gene (effect direction), with MASLD-related measurements (C) as well as outcomes and risk factors (D). Only associations with p < 0.05 are shown. For most measurements and all outcomes and risk factors, associations represent a meta-analysis of results from UK Biobank, All of Us, and BioMe. The color of each box indicates the effect size and direction (Z score); for display purposes, Z scores are capped at 10 and − 10. Complete data are available in Additional file 1: Tables S9–S10. Abbreviations: CRP, C-reactive protein; ALP alkaline phosphatase; ALT, alanine aminotransferase; AST, aspartate aminotransferase; GGT, gamma-glutamyltransferase; SHBG, sex hormone binding globulin
Fig. 3
Fig. 3
Construction of machine learning-predicted phenotypes in the UK Biobank. A 2D histogram comparing predicted log(PDFF) at the imaging visit with true log(PDFF) at the imaging visit. B 2D histogram comparing predicted log(PDFF) at the baseline visit with true log(PDFF) at the imaging visit. C Top 20 features for the PDFF prediction model as determined by SHapley Additive exPlanations (SHAP) analysis. The color of each bar reflects Spearman’s ρ between feature values and SHAP values. D Genetic correlations between true and predicted phenotypes in the UK Biobank. “Predicted PDFF (subset)” represents testing of predicted PDFF only among participants with true PDFF measureemnts. Abbreviations: SHBG (sex hormone binding globulin)
Fig. 4
Fig. 4
Variants and genes identified by predicted phenotypes in the UK Biobank. AB Manhattan plots showing associations from single variant testing (A) and gene-level testing (B). The red dashed lines represent thresholds for exome-wide significance (A; 4.3 × 10−7) and Bonferroni significance (B; 2.7 × 10−6). Genes and variants marked with an asterisk did not pass post hoc filtering (i.e., nominal association with a true phenotype or with both liver enzymes and metabolic dysfunction markers) and were not further analyzed
Fig. 5
Fig. 5
MASLD-related trait associations for predicted phenotype variants and genes. AD Single variant associations, labeled as chromosome:position:reference allele:effect allele (gene) (effect direction), and gene-level associations, labeled as gene (effect direction), with MASLD-related measurements (AB) as well as outcomes and risk factors (CD). Only associations with p < 0.05 are shown. For most phenotypes, associations represent a meta-analysis of results from UK Biobank, All of Us, and BioMe. The color of each box indicates the effect size and direction (Z score). Complete data are available in Additional file 1: Tables S9–S10. A, C Single variants and genes positively associated with predicted MASLD/PDFF. B, D Single variants and genes negatively associated with predicted MASLD/PDFF. Abbreviations: CRP, C-reactive protein; ALP, alkaline phosphatase; ALT, alanine aminotransferase; AST, aspartate aminotransferase; GGT, gamma-glutamyltransferase; SHBG, sex hormone binding globulin
Fig. 6
Fig. 6
Supporting evidence for MASLD-associated genes. Clinical, experimental, and genetic evidence for genes identified in this study. Genes in tiers 1, 2, 3, and 4 have ≥ 4, 3, 2, or ≤ 1 sources of evidence, respectively. For clinical phenotypes, “MASLD” includes monogenic causes of steatosis (e.g., abetalipoproteinemia, familial partial lipodystrophy, glycogen storage disease 1a). For clinical phenotypes, animal models, and genetic associations, “MASLD-related phenotype” includes metabolic dysfunction phenotypes (e.g., metabolic syndrome, type 2 diabetes). Complete data are available in Additional file 1: Table S24

Similar articles

References

    1. Eslam M, George J. Genetic contributions to NAFLD: leveraging shared genetics to uncover systems biology. Nat Rev Gastroenterol Hepatol. 2020;17:40–52. - PubMed
    1. Chen Y, Du X, Kuppa A, Feitosa MF, Bielak LF, O’Connell JR, et al. Genome-wide association meta-analysis identifies 17 loci associated with nonalcoholic fatty liver disease. Nat Genet. 2023;55:1640–50. - PMC - PubMed
    1. Sveinbjornsson G, Ulfarsson MO, Thorolfsdottir RB, Jonsson BA, Einarsson E, Gunnlaugsson G, et al. Multiomics study of nonalcoholic fatty liver disease. Nat Genet. 2022;54:1652–63. - PMC - PubMed
    1. Harrison SA, Bedossa P, Guy CD, Schattenberg JM, Loomba R, Taub R, et al. A phase 3, randomized, controlled trial of resmetirom in NASH with liver fibrosis. N Engl J Med. 2024;390:497–509. - PubMed
    1. Baselli GA, Jamialahmadi O, Pelusi S, Ciociola E, Malvestiti F, Saracino M, et al. Rare ATG7 genetic variants predispose patients to severe fatty liver disease. J Hepatol. 2022;77:596–606. - PubMed

LinkOut - more resources