Rapid identification and phenotyping of nonalcoholic fatty liver disease patients using a machine-based approach in diverse healthcare systems

Anna O Basile^{1

2}, Anurag Verma³, Leigh Anne Tang⁴, Marina Serper⁵, Andrew Scanga⁶, Ava Farrell⁷, Brittney Destin⁸, Rotonya M Carr⁹, Anuli Anyanwu-Ofili¹⁰, Gunaretnam Rajagopal^{10

11}, Abraham Krikhely⁸, Marc Bessler⁸, Muredach P Reilly^{12

13}, Marylyn D Ritchie¹⁴, Nicholas P Tatonetti^{1

15

16}, Julia Wattacheril¹⁷

Affiliations

¹ Department of Biomedical Informatics, Columbia University, New York, New York, USA.
² Department of Computational Biology, New York Genome Center, New York, New York, USA.
³ Division of Translational Medicine and Human Genetics, Department of Medicine, Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA.
⁴ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
⁵ Division of Gastroenterology and Hepatology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA.
⁶ Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
⁷ Division of Critical Care, Department of Pediatrics, New York Presbyterian Morgan Stanley Children's Hospital of New York, New York, New York, USA.
⁸ Division of Metabolic and Bariatric Surgery, Department of Surgery, Columbia University Irving Medical Center, New York, New York, USA.
⁹ Division of Gastroenterology, Department of Medicine, University of Washington, Seattle, Washington, USA.
¹⁰ Johnson & Johnson Innovative Medicine, Spring House, Pennsylvania, USA.
¹¹ Samsara BioCapital, Palo Alto, California, USA.
¹² Irving Institute for Clinical and Translational Research, Columbia University, New York, New York, USA.
¹³ Division of Cardiology, Department of Medicine, Columbia University Irving Medical Center, New York, New York, USA.
¹⁴ Department of Genetics, Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA.
¹⁵ Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, California, USA.
¹⁶ Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, California, USA.
¹⁷ Division of Digestive and Liver Diseases, Department of Medicine, Center for Liver Disease and Transplantation, Columbia University Irving Medical Center, New York, New York, USA.

PMID: 39739635
PMCID: PMC11686338
DOI: 10.1111/cts.70105

Rapid identification and phenotyping of nonalcoholic fatty liver disease patients using a machine-based approach in diverse healthcare systems

Anna O Basile et al. Clin Transl Sci. 2025 Jan.

. 2025 Jan;18(1):e70105.

doi: 10.1111/cts.70105.

Authors

Affiliations

¹ Department of Biomedical Informatics, Columbia University, New York, New York, USA.
² Department of Computational Biology, New York Genome Center, New York, New York, USA.
³ Division of Translational Medicine and Human Genetics, Department of Medicine, Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA.
⁴ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
⁵ Division of Gastroenterology and Hepatology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA.
⁶ Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
⁷ Division of Critical Care, Department of Pediatrics, New York Presbyterian Morgan Stanley Children's Hospital of New York, New York, New York, USA.
⁸ Division of Metabolic and Bariatric Surgery, Department of Surgery, Columbia University Irving Medical Center, New York, New York, USA.
⁹ Division of Gastroenterology, Department of Medicine, University of Washington, Seattle, Washington, USA.
¹⁰ Johnson & Johnson Innovative Medicine, Spring House, Pennsylvania, USA.
¹¹ Samsara BioCapital, Palo Alto, California, USA.
¹² Irving Institute for Clinical and Translational Research, Columbia University, New York, New York, USA.
¹³ Division of Cardiology, Department of Medicine, Columbia University Irving Medical Center, New York, New York, USA.
¹⁴ Department of Genetics, Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA.
¹⁵ Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, California, USA.
¹⁶ Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, California, USA.
¹⁷ Division of Digestive and Liver Diseases, Department of Medicine, Center for Liver Disease and Transplantation, Columbia University Irving Medical Center, New York, New York, USA.

PMID: 39739635
PMCID: PMC11686338
DOI: 10.1111/cts.70105

Abstract

Nonalcoholic fatty liver disease (NAFLD) is the most common global cause of chronic liver disease and remains under-recognized within healthcare systems. Therapeutic interventions are rapidly advancing for its inflammatory phenotype, nonalcoholic steatohepatitis (NASH) at all stages of disease. Diagnosis codes alone fail to recognize and stratify at-risk patients accurately. Our work aims to rapidly identify NAFLD patients within large electronic health record (EHR) databases for automated stratification and targeted intervention based on clinically relevant phenotypes. We present a rule-based phenotyping algorithm for efficient identification of NAFLD patients developed using EHRs from 6.4 million patients at Columbia University Irving Medical Center (CUIMC) and validated at two independent healthcare centers. The algorithm uses the Observational Medical Outcomes Partnership (OMOP) Common Data Model and queries structured and unstructured data elements, including diagnosis codes, laboratory measurements, and radiology and pathology modalities. Our approach identified 16,006 CUIMC NAFLD patients, 10,753 (67%) previously unidentifiable by NAFLD diagnosis codes. Fibrosis scoring on patients without histology identified 943 subjects with scores indicative of advanced fibrosis (FIB-4, APRI, NAFLD-FS). The algorithm was validated at two independent healthcare systems, University of Pennsylvania Health System (UPHS) and Vanderbilt Medical Center (VUMC), where 20,779 and 19,575 NAFLD patients were identified, respectively. Clinical chart review identified a high positive predictive value (PPV) across all healthcare systems: 91% at CUIMC, 75% at UPHS, and 85% at VUMC, and a sensitivity of 79.6%. Our rule-based algorithm provides an accurate, automated approach for rapidly identifying, stratifying, and sub-phenotyping NAFLD patients within a large EHR system.

PubMed Disclaimer

Conflict of interest statement

Patent for algorithm to Columbia University Trustees; © 2021 The Trustees of Columbia University in the City of New York. The owner has no objection to reproduction of the work for academic non‐commercial purposes, but otherwise reserves all copyright rights whatsoever. J.W., N.P.T. and A.O.B. are co‐inventors. J.W. has received research support from Janssen, Galectin, Intercept, Genfit, Shire, Conatus, Zydus, AMRA, and is on advisory boards for GlaxoSmithKline and AstraZeneca. R.M.C. has received research support from Intercept Pharmaceuticals and Merck, Inc., and consulting fees from Intercept and AstraZeneca. M.S. is funded by NIDDK R01DK132138, R01DK131547 and has an unrestricted grant from Grifols, SA. A.K. is a speaker and proctor for Intuitive, a reviewer for surgical videos for Crowd Sourced Assessment of Technical Skills (CSATs), and a consultant for Johnson and Johnson and Surgical Specialties Corporation. G.R. retired from Janssen Pharma R&D as Scientific Fellow and Head of Computational Sciences and is currently a Venture Partner in Samsara BioCapital, Palo Alto, CA. All other authors declared no competing interests for this work.

Figures

**FIGURE 1**
Illustration of the NAFLD algorithm development and validation process. The algorithm was developed at Columbia University Irving Medical Center (CUIMC) by clinical and informatics teams. Clinical criteria for the algorithm, provided by medical experts, was used by the bioinformatics team to design queries that produced a set of potential NAFLD patients. The charts of these patients were reviewed by the clinical team to determine true NAFLD status. Clinical criteria, as represented in the EHR system, was adjusted based on chart review results, and the queries were refined. This process was repeated across each step of the algorithm until high accuracy was achieved. Once achieved, the queries were used to code the algorithm. Algorithmic validation was performed at the University of Pennsylvania Health System (UPHS) and Vanderbilt Medical Center (VUMC) where the iterative process described above was repeated by clinical and informatics experts at each site. The final output of the algorithm is a list of NAFLD patients along with clinical characteristics (subset depicted above). A1c, glycated hemoglobin; DOB, date of birth; EHR, electronic health record; N, no; T2D, type 2 diabetes; Y, yes. Stock images for this figure are from BioRender.com.

**FIGURE 2**
Three main steps of the NAFLD algorithm. Step 1a lists the NAFLD risk indicator categories that were used for patient identification and lists a few examples of selection criteria. A complete list of codes for selection or exclusion criteria can be found in the following Supplementary Tables: Step 1a (Table S1), Step 1b (Table S2), Step 2 (Table S3), and Step 3 (Tables S4 and S5). dx = diagnosis code (International Classification of Diseases, Ninth and Tenth Revision (ICD9/10)).

**FIGURE 3**
Counts of patients at each stage of the algorithm. Data from Columbia University Irving Medical Center (CUIMC) is in blue, that from University of Pennsylvania Healthcare System (UPHS) is in pink/purple, and numbers from Vanderbilt Medical Center (VUMC) are in orange.

**FIGURE 4**
Proportion of patients with select risk indicator diagnoses of those with 1 (n = 2245), 2 (n = 943), or 3 (n = 204) elevated scores (APRI, NAFLD FS, FIB‐4).

**FIGURE 5**
Sensitivity at Columbia University Irving Medical Center (CUIMC) after each stage of the algorithm. Sensitivity was assessed across three categories: (1). using all patients within the NAFLD registry maintained by CUIMC hepatology (blue, n = 167), (2). restricting to patients with complete data within the clinical data warehouse (red, n = 147), and (3). restricting to patients with physician‐validated exclusion codes (orange, n = 137). “Inclusion” refers to the identification of NAFLD patients. “Exclusion” is the removal of patients meeting exclusion criteria. “Verification” refers to the verification of hepatic steatosis, the final step of our algorithm (see Figure 2).

See this image and copyright information in PMC

References

1. Younossi Z, Anstee QM, Marietti M, et al. Global burden of NAFLD and NASH: trends, predictions, risk factors and prevention. Nat Rev Gastroenterol Hepatol. 2018;15:11‐20. - PubMed
1. Malhi H, Brown RS, Lim JK, et al. Precipitous changes in nomenclature and definitions‐NAFLD becomes SLD: implications for and expectations of AASLD journals. Hepatology. 2023;78:1680‐1681. - PubMed
1. Alexander M, Loomis AK, Fairburn‐Beech J, et al. Real‐world data reveal a diagnostic gap in non‐alcoholic fatty liver disease. BMC Med. 2018;16:130. - PMC - PubMed
1. Harrison SA, Pierre B, Guy CD, et al. A phase 3, randomized, controlled trial of Resmetirom in NASH with liver fibrosis. N Engl J Med. 2024;390:497‐509. - PubMed
1. Estes C, Razavi H, Loomba R, Younossi Z, Sanyal AJ. Modeling the epidemic of nonalcoholic fatty liver disease demonstrates an exponential increase in burden of disease. Hepatology. 2018;67:123‐133. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
- Wiley
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Rapid identification and phenotyping of nonalcoholic fatty liver disease patients using a machine-based approach in diverse healthcare systems

Affiliations

Rapid identification and phenotyping of nonalcoholic fatty liver disease patients using a machine-based approach in diverse healthcare systems

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical