Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan;18(1):e70105.
doi: 10.1111/cts.70105.

Rapid identification and phenotyping of nonalcoholic fatty liver disease patients using a machine-based approach in diverse healthcare systems

Affiliations

Rapid identification and phenotyping of nonalcoholic fatty liver disease patients using a machine-based approach in diverse healthcare systems

Anna O Basile et al. Clin Transl Sci. 2025 Jan.

Abstract

Nonalcoholic fatty liver disease (NAFLD) is the most common global cause of chronic liver disease and remains under-recognized within healthcare systems. Therapeutic interventions are rapidly advancing for its inflammatory phenotype, nonalcoholic steatohepatitis (NASH) at all stages of disease. Diagnosis codes alone fail to recognize and stratify at-risk patients accurately. Our work aims to rapidly identify NAFLD patients within large electronic health record (EHR) databases for automated stratification and targeted intervention based on clinically relevant phenotypes. We present a rule-based phenotyping algorithm for efficient identification of NAFLD patients developed using EHRs from 6.4 million patients at Columbia University Irving Medical Center (CUIMC) and validated at two independent healthcare centers. The algorithm uses the Observational Medical Outcomes Partnership (OMOP) Common Data Model and queries structured and unstructured data elements, including diagnosis codes, laboratory measurements, and radiology and pathology modalities. Our approach identified 16,006 CUIMC NAFLD patients, 10,753 (67%) previously unidentifiable by NAFLD diagnosis codes. Fibrosis scoring on patients without histology identified 943 subjects with scores indicative of advanced fibrosis (FIB-4, APRI, NAFLD-FS). The algorithm was validated at two independent healthcare systems, University of Pennsylvania Health System (UPHS) and Vanderbilt Medical Center (VUMC), where 20,779 and 19,575 NAFLD patients were identified, respectively. Clinical chart review identified a high positive predictive value (PPV) across all healthcare systems: 91% at CUIMC, 75% at UPHS, and 85% at VUMC, and a sensitivity of 79.6%. Our rule-based algorithm provides an accurate, automated approach for rapidly identifying, stratifying, and sub-phenotyping NAFLD patients within a large EHR system.

PubMed Disclaimer

Conflict of interest statement

Patent for algorithm to Columbia University Trustees; © 2021 The Trustees of Columbia University in the City of New York. The owner has no objection to reproduction of the work for academic non‐commercial purposes, but otherwise reserves all copyright rights whatsoever. J.W., N.P.T. and A.O.B. are co‐inventors. J.W. has received research support from Janssen, Galectin, Intercept, Genfit, Shire, Conatus, Zydus, AMRA, and is on advisory boards for GlaxoSmithKline and AstraZeneca. R.M.C. has received research support from Intercept Pharmaceuticals and Merck, Inc., and consulting fees from Intercept and AstraZeneca. M.S. is funded by NIDDK R01DK132138, R01DK131547 and has an unrestricted grant from Grifols, SA. A.K. is a speaker and proctor for Intuitive, a reviewer for surgical videos for Crowd Sourced Assessment of Technical Skills (CSATs), and a consultant for Johnson and Johnson and Surgical Specialties Corporation. G.R. retired from Janssen Pharma R&D as Scientific Fellow and Head of Computational Sciences and is currently a Venture Partner in Samsara BioCapital, Palo Alto, CA. All other authors declared no competing interests for this work.

Figures

FIGURE 1
FIGURE 1
Illustration of the NAFLD algorithm development and validation process. The algorithm was developed at Columbia University Irving Medical Center (CUIMC) by clinical and informatics teams. Clinical criteria for the algorithm, provided by medical experts, was used by the bioinformatics team to design queries that produced a set of potential NAFLD patients. The charts of these patients were reviewed by the clinical team to determine true NAFLD status. Clinical criteria, as represented in the EHR system, was adjusted based on chart review results, and the queries were refined. This process was repeated across each step of the algorithm until high accuracy was achieved. Once achieved, the queries were used to code the algorithm. Algorithmic validation was performed at the University of Pennsylvania Health System (UPHS) and Vanderbilt Medical Center (VUMC) where the iterative process described above was repeated by clinical and informatics experts at each site. The final output of the algorithm is a list of NAFLD patients along with clinical characteristics (subset depicted above). A1c, glycated hemoglobin; DOB, date of birth; EHR, electronic health record; N, no; T2D, type 2 diabetes; Y, yes. Stock images for this figure are from BioRender.com.
FIGURE 2
FIGURE 2
Three main steps of the NAFLD algorithm. Step 1a lists the NAFLD risk indicator categories that were used for patient identification and lists a few examples of selection criteria. A complete list of codes for selection or exclusion criteria can be found in the following Supplementary Tables: Step 1a (Table S1), Step 1b (Table S2), Step 2 (Table S3), and Step 3 (Tables S4 and S5). dx = diagnosis code (International Classification of Diseases, Ninth and Tenth Revision (ICD9/10)).
FIGURE 3
FIGURE 3
Counts of patients at each stage of the algorithm. Data from Columbia University Irving Medical Center (CUIMC) is in blue, that from University of Pennsylvania Healthcare System (UPHS) is in pink/purple, and numbers from Vanderbilt Medical Center (VUMC) are in orange.
FIGURE 4
FIGURE 4
Proportion of patients with select risk indicator diagnoses of those with 1 (n = 2245), 2 (n = 943), or 3 (n = 204) elevated scores (APRI, NAFLD FS, FIB‐4).
FIGURE 5
FIGURE 5
Sensitivity at Columbia University Irving Medical Center (CUIMC) after each stage of the algorithm. Sensitivity was assessed across three categories: (1). using all patients within the NAFLD registry maintained by CUIMC hepatology (blue, n = 167), (2). restricting to patients with complete data within the clinical data warehouse (red, n = 147), and (3). restricting to patients with physician‐validated exclusion codes (orange, n = 137). “Inclusion” refers to the identification of NAFLD patients. “Exclusion” is the removal of patients meeting exclusion criteria. “Verification” refers to the verification of hepatic steatosis, the final step of our algorithm (see Figure 2).

Similar articles

Cited by

References

    1. Younossi Z, Anstee QM, Marietti M, et al. Global burden of NAFLD and NASH: trends, predictions, risk factors and prevention. Nat Rev Gastroenterol Hepatol. 2018;15:11‐20. - PubMed
    1. Malhi H, Brown RS, Lim JK, et al. Precipitous changes in nomenclature and definitions‐NAFLD becomes SLD: implications for and expectations of AASLD journals. Hepatology. 2023;78:1680‐1681. - PubMed
    1. Alexander M, Loomis AK, Fairburn‐Beech J, et al. Real‐world data reveal a diagnostic gap in non‐alcoholic fatty liver disease. BMC Med. 2018;16:130. - PMC - PubMed
    1. Harrison SA, Pierre B, Guy CD, et al. A phase 3, randomized, controlled trial of Resmetirom in NASH with liver fibrosis. N Engl J Med. 2024;390:497‐509. - PubMed
    1. Estes C, Razavi H, Loomba R, Younossi Z, Sanyal AJ. Modeling the epidemic of nonalcoholic fatty liver disease demonstrates an exponential increase in burden of disease. Hepatology. 2018;67:123‐133. - PMC - PubMed

LinkOut - more resources