Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 9;15(1):24607.
doi: 10.1038/s41598-025-05838-9.

A computational framework for defining and validating reproducible phenotyping algorithms of 313 diseases in the UK Biobank

Affiliations

A computational framework for defining and validating reproducible phenotyping algorithms of 313 diseases in the UK Biobank

Ana Torralbo et al. Sci Rep. .

Abstract

Accurate and reproducible phenotyping is essential for large-scale biomedical research. However, developing robust phenotype definitions in biobanks is challenging due to diverse data sources and varying medical ontologies. As a result, the current phenotyping landscape is fragmented. We developed a computational framework to harmonize electronic health record (EHR) data, participant questionnaires, and clinical registry information, defining 313 disease phenotypes among 502,356 UK Biobank (UKB) participants. Our method integrated four medical ontologies (Read v2, CTV3, ICD-10, OPCS-4) across seven data sources, including primary care, hospital admissions, cancer and death registries, and self-reported data on diseases, procedures, and medication. Phenotypes underwent multi-layered validation, assessing data source concordance, age-sex incidence and prevalence patterns, external comparison to a representative UK EHR dataset, modifiable risk factor associations, and genetic correlations with external genome-wide association studies (GWAS). Results indicated consistent disease distributions by age and sex, high correlation with non-selected general population data prevalence estimates, confirmed risk factor associations, and significant genetic correlations with external GWAS for nine of ten evaluated diseases. Our approach establishes comprehensive disease validation profiles, improving phenotype generalizability despite inherent UKB demographic biases. The modular, reproducible framework can be extended to additional diseases and populations, supporting federated analyses across diverse biobanks, and facilitating research in underrepresented populations.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: MGE, JCW, DCCC, ASC, TGR and JMD are employees and stakeholders of GlaxoSmithKline (GSK). JCW holds shares in GSK and was an employee of GSK until the end 2022. He has done extensive consultancy for pharma/biotech companies, with active agreements with Novo Nordisk, Relation Therapeutics Limited and Silence Therapeutics Plc. CT was employed at LifeArc, after manuscript preparation MB receives personal fees from GRAIL for membership of an Independent Data Monitoring Committee unrelated to this study. GF carried out work for this manuscript prior to starting as an employee at Novo Nordisk. The remaining authors have no conflict of interest.

Figures

Fig. 1
Fig. 1
Overview of the phenotyping framework implementation in the UK Biobank, including (a) data sources (left), (b) Extract, Transform and Load (ETL) processes to obtain records based on the phenotype definition (middle), (c) phenotype validation layers.
Fig. 2
Fig. 2
Proportion of patients (%) per phenotype identified in each source including example phenotypes with high case ascertainment in primary care (patients can appear in multiple sources). Cells with no digits denote sources not available in the phenotype definition.
Fig. 3
Fig. 3
Baseline prevalence, incidence rate (per 10,000 person years) stratified by sex and age at UKB entry for exemplar diseases (error bars are 95% CI).
Fig. 4
Fig. 4
Log10 sex-standardised period prevalence for CALIBER and UKB in EHR sources (A), CALIBER and UKB in any source (EHR sources and diseases self-reported in the baseline questionnaires) (B), and UKB in EHR and UKB in any source (EHR including self-reported diseases) (C) for age group 40–49 at UKB baseline, and including example diseases. Colour denotes disease groups.
Fig. 5
Fig. 5
Baseline (UKB recruitment) risk factor associations with disease onset (all P < 0.0002) of current smoking (A), hypertension (B) and obese BMI (C).
Fig. 6
Fig. 6
Validation profile for psoriasis.
Fig. 7
Fig. 7
Validation profile for type 2 diabetes.

References

    1. Zhou, W. et al. Global biobank meta-analysis initiative: Powering genetic discovery across human disease. Cell Genom.2, 100192. 10.1016/j.xgen.2022.100192 (2022). - PMC - PubMed
    1. Sudlow, C. et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med.12, e1001779. 10.1371/journal.pmed.1001779 (2015). - PMC - PubMed
    1. Finer, S. et al. Cohort profile: East London genes & health (ELGH), a community-based population genomics and health study in British Bangladeshi and British Pakistani people. Int. J. Epidemiol.49, 20–21i. 10.1093/ije/dyz174 (2020). - PMC - PubMed
    1. Nguyen, X.-M.T. et al. Data resource profile: Self-reported data in the million veteran program: Survey development and insights from the first 850 736 participants. Int. J. Epidemiol.52, 1–17. 10.1093/ije/dyac133 (2023). - PubMed
    1. Kurki, M. I. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature613, 508–518. 10.1038/s41586-022-05473-8 (2023). - PMC - PubMed

LinkOut - more resources