Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 24;25(1):487.
doi: 10.1186/s12890-025-03953-x.

Identification of clinically meaningful, overlapping obstructive respiratory disease subtypes via data-driven approaches in a primary care population

Affiliations

Identification of clinically meaningful, overlapping obstructive respiratory disease subtypes via data-driven approaches in a primary care population

Maria Pikoula et al. BMC Pulm Med. .

Abstract

Background: Obstructive respiratory conditions, including asthma, bronchiectasis, and chronic obstructive pulmonary disease (COPD), are increasingly recognised as heterogeneous syndromes with significant overlap. Multiple disease pathways contribute to phenotypes that do not always align with textbook definitions, limiting the effectiveness of a one-size-fits-all approach. This study aims to identify, validate, and characterise clinically meaningful airway disease subtypes using electronic healthcare records (EHR) and unsupervised machine learning clustering techniques.

Methods: We applied k-means clustering to 626,651 patients with a diagnosis of asthma, bronchiectasis, or COPD, using linked national structured EHRs in England. Twenty-one clinical features, including risk factors and comorbidities, were analysed, with dimensionality reduction via principal component and multiple correspondence analyses. Associations between cluster membership and exacerbations, as well as respiratory and cardiovascular mortality, were assessed. Over 3,696,962 person-years of follow-up, 102,522 deaths were recorded. Cluster stability was evaluated after five years, and genome-wide association studies (GWAS) were conducted to explore genetic associations with cluster membership.

Results: Seven clusters were identified, each encompassing patients across traditional diagnostic labels. Distinct clinical patterns emerged as follows: (1) High BMI female predominant, (2) Older male-predominant with diabetes and cardiovascular disease, (3) Eosinophilic atopic, (4) Older non-comorbid, (5) Non-comorbid low BMI, (6) Neutrophilic smoker, (7) Anxious/depressed female-predominant.The cluster with cardiovascular comorbidities showed the highest rates of hospital admissions for exacerbations. Neutrophilic cluster 6 is a potential novel subtype marked by persistent neutrophilia and poor outcomes. Cluster stability over five years ranged from 38% to 78%. GWAS revealed significant genetic loci in a cluster enriched for allergic disease and eosinophilia, suggesting shared genetic mechanisms.

Conclusions: This study provides a data-driven dissection of the heterogeneity underlying obstructive airway diseases in a large, real-world population. Unsupervised machine learning applied to national-scale EHR data revealed distinct and partially stable subtypes that transcend conventional diagnostic boundaries. These findings highlight the complexity and overlap of airway disease phenotypes and demonstrate the value of clustering approaches for uncovering clinically and biologically meaningful subgroups. This work lays the foundation for further exploration into mechanisms and prognosis within and across airway disease phenotypes.

Keywords: Asthma; Bronchiectasis; CALIBER; Chronic Obstructive Pulmonary Disease; Cluster Analysis; Electronic Health Records; Genome-wide association studies; Machine Learning; UKBiobank.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This research was conducted in accordance with the Declaration of Helsinki. A protocol for this research was approved by the Independent Scientific Advisory Committee (ISAC) for MHRA Database Research (protocol number 22_001747). Generic ethical approval for observational research using CPRD with approval from ISAC has been granted by a Health Research Authority (HRA) Research Ethics Committee (East Midlands–Derby, REC reference number 05/MRE04/87). RB approval was obtained by the UK Biobank (reference 37126) and University College London Research Ethics Committee (application 14629/001). In all CPRD studies, consent is not given on an individual patient level. Selected general practices consent to this process at a practice level, with individual patients having the right to opt out. All participants provided informed consent at the time of recruitment to the UKB. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Patient flow diagram. Individuals were eligible for inclusion if: a they were (or turned) 18 years of age or older during the study period, b they had been registered for at least one year in a primary care practice and c had a diagnosis of asthma, bronchiectasis or COPD, with the additional restriction that COPD diagnosis would have to be made age 35 or older
Fig. 2
Fig. 2
Schematic illustration of the analysis pipeline
Fig. 3
Fig. 3
Radal plot illustrating the average feature importance for some of the key cluster characteristics. Ranges and units for each feature are given in brackets
Fig. 4
Fig. 4
Airways disease diagnosis broken down by cluster, sorted by asthma prevalence. k1: High BMI female-predominant, k2: Older male-predominant with Diabetes and CVD comorbidities, k3: Eosinophilic atopic, k4: Older non-comorbid, k5: Non-comorbid low BMI, k6: Neutrophilic smoker, k7: Anxious/depressed female-predominant
Fig. 5
Fig. 5
Cumulative average number of exacerbations over a 5-year period: in primary care and b in hospital
Fig. 6
Fig. 6
Cumulative average number of Emergency Department attendance over a 5-year period: a Respiratory presenting complaint and b medical complaint (infectious disease, sepsis, cardiac conditions, cerebrovascular, respiratory, diabetes and endocrine)
Fig. 7
Fig. 7
Alluvial diagram showing the crossover and stability of patients in each cluster over 5 years

References

    1. Deliu M, Sperrin M, Belgrave D, Custovic A. Identification of asthma subtypes using clustering methodologies. Pulm Ther. 2016;2:19–41. - PMC - PubMed
    1. Garcia-Aymerich J, Gómez FP, Benet M, Farrero E, Basagaña X, Gayete À, et al. Identification and prospective validation of clinically relevant chronic obstructive pulmonary disease (COPD) subtypes. Thorax. 2011;66:430–7. - PubMed
    1. Athanazio R. Airway disease: similarities and differences between asthma, COPD and bronchiectasis. Clinics (Sao Paulo). 2012;67:1335. - PMC - PubMed
    1. Wouters EFM, Wouters BBRAF, Augustin IML, Franssen FME. Personalized medicine and chronic obstructive pulmonary disease. Curr Opin Pulm Med. 2017;23:241–6. - PubMed
    1. Agusti A, Bel E, Thomas M, Vogelmeier C, Brusselle G, Holgate S, et al. Treatable traits: toward precision medicine of chronic airway diseases. Eur Respir J. 2016;47:410–9. - PubMed

MeSH terms