Identification of clinically meaningful, overlapping obstructive respiratory disease subtypes via data-driven approaches in a primary care population
- PMID: 41137006
- PMCID: PMC12553182
- DOI: 10.1186/s12890-025-03953-x
Identification of clinically meaningful, overlapping obstructive respiratory disease subtypes via data-driven approaches in a primary care population
Abstract
Background: Obstructive respiratory conditions, including asthma, bronchiectasis, and chronic obstructive pulmonary disease (COPD), are increasingly recognised as heterogeneous syndromes with significant overlap. Multiple disease pathways contribute to phenotypes that do not always align with textbook definitions, limiting the effectiveness of a one-size-fits-all approach. This study aims to identify, validate, and characterise clinically meaningful airway disease subtypes using electronic healthcare records (EHR) and unsupervised machine learning clustering techniques.
Methods: We applied k-means clustering to 626,651 patients with a diagnosis of asthma, bronchiectasis, or COPD, using linked national structured EHRs in England. Twenty-one clinical features, including risk factors and comorbidities, were analysed, with dimensionality reduction via principal component and multiple correspondence analyses. Associations between cluster membership and exacerbations, as well as respiratory and cardiovascular mortality, were assessed. Over 3,696,962 person-years of follow-up, 102,522 deaths were recorded. Cluster stability was evaluated after five years, and genome-wide association studies (GWAS) were conducted to explore genetic associations with cluster membership.
Results: Seven clusters were identified, each encompassing patients across traditional diagnostic labels. Distinct clinical patterns emerged as follows: (1) High BMI female predominant, (2) Older male-predominant with diabetes and cardiovascular disease, (3) Eosinophilic atopic, (4) Older non-comorbid, (5) Non-comorbid low BMI, (6) Neutrophilic smoker, (7) Anxious/depressed female-predominant.The cluster with cardiovascular comorbidities showed the highest rates of hospital admissions for exacerbations. Neutrophilic cluster 6 is a potential novel subtype marked by persistent neutrophilia and poor outcomes. Cluster stability over five years ranged from 38% to 78%. GWAS revealed significant genetic loci in a cluster enriched for allergic disease and eosinophilia, suggesting shared genetic mechanisms.
Conclusions: This study provides a data-driven dissection of the heterogeneity underlying obstructive airway diseases in a large, real-world population. Unsupervised machine learning applied to national-scale EHR data revealed distinct and partially stable subtypes that transcend conventional diagnostic boundaries. These findings highlight the complexity and overlap of airway disease phenotypes and demonstrate the value of clustering approaches for uncovering clinically and biologically meaningful subgroups. This work lays the foundation for further exploration into mechanisms and prognosis within and across airway disease phenotypes.
Keywords: Asthma; Bronchiectasis; CALIBER; Chronic Obstructive Pulmonary Disease; Cluster Analysis; Electronic Health Records; Genome-wide association studies; Machine Learning; UKBiobank.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: This research was conducted in accordance with the Declaration of Helsinki. A protocol for this research was approved by the Independent Scientific Advisory Committee (ISAC) for MHRA Database Research (protocol number 22_001747). Generic ethical approval for observational research using CPRD with approval from ISAC has been granted by a Health Research Authority (HRA) Research Ethics Committee (East Midlands–Derby, REC reference number 05/MRE04/87). RB approval was obtained by the UK Biobank (reference 37126) and University College London Research Ethics Committee (application 14629/001). In all CPRD studies, consent is not given on an individual patient level. Selected general practices consent to this process at a practice level, with individual patients having the right to opt out. All participants provided informed consent at the time of recruitment to the UKB. Competing interests: The authors declare no competing interests.
Figures
 
              
              
              
              
                
                
                 
              
              
              
              
                
                
                 
              
              
              
              
                
                
                 
              
              
              
              
                
                
                 
              
              
              
              
                
                
                 
              
              
              
              
                
                
                 
              
              
              
              
                
                
                References
- 
    - Garcia-Aymerich J, Gómez FP, Benet M, Farrero E, Basagaña X, Gayete À, et al. Identification and prospective validation of clinically relevant chronic obstructive pulmonary disease (COPD) subtypes. Thorax. 2011;66:430–7. - PubMed
 
- 
    - Wouters EFM, Wouters BBRAF, Augustin IML, Franssen FME. Personalized medicine and chronic obstructive pulmonary disease. Curr Opin Pulm Med. 2017;23:241–6. - PubMed
 
- 
    - Agusti A, Bel E, Thomas M, Vogelmeier C, Brusselle G, Holgate S, et al. Treatable traits: toward precision medicine of chronic airway diseases. Eur Respir J. 2016;47:410–9. - PubMed
 
MeSH terms
Grants and funding
LinkOut - more resources
- Full Text Sources
- Medical
 
        