. 2021 Mar 8;11(1):5405.

doi: 10.1038/s41598-021-84860-z.

Characterisation, identification, clustering, and classification of disease

A J Webster¹, K Gaitskell^{2

3}, I Turnbull², B J Cairns^{2

4}, R Clarke²

Affiliations

¹ Nuffield Department of Population Health, University of Oxford, Oxford, UK. anthony.webster@ndph.ox.ac.uk.
² Nuffield Department of Population Health, University of Oxford, Oxford, UK.
³ Nuffield Division of Clinical Laboratory Sciences, Radcliffe Department of Medicine, University of Oxford, Oxford, UK.
⁴ MRC Population Health Research Unit, Nuffield Department of Population Health, University of Oxford, Oxford, UK.

PMID: 33686097
PMCID: PMC7940639
DOI: 10.1038/s41598-021-84860-z

Characterisation, identification, clustering, and classification of disease

A J Webster et al. Sci Rep. 2021.

. 2021 Mar 8;11(1):5405.

doi: 10.1038/s41598-021-84860-z.

Authors

A J Webster¹, K Gaitskell^{2

3}, I Turnbull², B J Cairns^{2

4}, R Clarke²

Affiliations

¹ Nuffield Department of Population Health, University of Oxford, Oxford, UK. anthony.webster@ndph.ox.ac.uk.
² Nuffield Department of Population Health, University of Oxford, Oxford, UK.
³ Nuffield Division of Clinical Laboratory Sciences, Radcliffe Department of Medicine, University of Oxford, Oxford, UK.
⁴ MRC Population Health Research Unit, Nuffield Department of Population Health, University of Oxford, Oxford, UK.

PMID: 33686097
PMCID: PMC7940639
DOI: 10.1038/s41598-021-84860-z

Abstract

The importance of quantifying the distribution and determinants of multimorbidity has prompted novel data-driven classifications of disease. Applications have included improved statistical power and refined prognoses for a range of respiratory, infectious, autoimmune, and neurological diseases, with studies using molecular information, age of disease incidence, and sequences of disease onset ("disease trajectories") to classify disease clusters. Here we consider whether easily measured risk factors such as height and BMI can effectively characterise diseases in UK Biobank data, combining established statistical methods in new but rigorous ways to provide clinically relevant comparisons and clusters of disease. Over 400 common diseases were selected for analysis using clinical and epidemiological criteria, and conventional proportional hazards models were used to estimate associations with 12 established risk factors. Several diseases had strongly sex-dependent associations of disease risk with BMI. Importantly, a large proportion of diseases affecting both sexes could be identified by their risk factors, and equivalent diseases tended to cluster adjacently. These included 10 diseases presently classified as "Symptoms, signs, and abnormal clinical and laboratory findings, not elsewhere classified". Many clusters are associated with a shared, known pathogenesis, others suggest likely but presently unconfirmed causes. The specificity of associations and shared pathogenesis of many clustered diseases provide a new perspective on the interactions between biological pathways, risk factors, and patterns of disease such as multimorbidity.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
The “elbow” in the weighted sum of squares of differences in the fitted parameters in each cluster (Eq. 1), at $≃$ 24 clusters, qualitatively indicates how many clusters to keep. With 63 or more clusters there are no statistically significant differences at the 0.05 level between fitted parameters in each cluster (inset).

**Figure 2**
689 diseases of men or women were categorised as acute, chronic, infectious, injuries, or symptoms of unknown cause (separate plots), and grouped by the number of cases (horizontal axes). We considered: whether associations were statistically significant at the 0.05 level after a Bonferroni multiple-testing adjustment?—no (orange), or if yes, whether proportional hazards test did (green) or did not (yellow) pass. The median number of cases was 214. The vertical axis for acute diseases has a different scale.

**Figure 3**
The proportion of diseases whose equivalent disease in the opposite sex has the smallest Bhattacharyya distance is plotted in green. The proportion of diseases with statistically significant differences between men and women are plotted in red. The differences are mainly due to different associations with BMI (inset).

**Figure 4**
Disease pairs with statistically significant differences in their associations with risk factors at the 0.05 level after an FDR multiple-testing adjustment. With all associations (left), and without BMI (right). Red indicates an association with higher risk for women than men, white a lower risk, and orange neutral. Without BMI as a risk factor, only two diseases continue to have statistically significant differences. The figures were produced with R and the “gplots” package.

**Figure 5**
The estimated fitting parameters and their covariance matrices were used to calculate the Bhattacharyya distances between diseases, and clustered hierarchically using the Ward.D2 algorithm. Diseases in men and women tend to cluster adjacently. Labels are coloured by their first ICD-10 digit, and the dendrogram is coloured with the top 24 groups in the cluster (see Fig. 1). Associations with potential risk factors are indicated by the heat map, with red an association with higher risk, white with lower risk, and orange neutral. The figure was produced with R using packages “dendextend” and “gplots”.

See this image and copyright information in PMC

References

1. Graunt, C. J. Natural and Political OBSERVATIONS Mentioned in a following INDEX, and made upon the Bills of Mortality (Printed by John Martyn, Printer to the Royal Society, at the Sign of the Bell in St. Paul’s Church-yard. MDCLXXVI., 1665). Appendix—The table of casualties—Table of Casualties in Economic Writings (vol. 2) by William Petty (1899), between p. 406 and 407.
1. Organization, W. H. International statistical classification of diseases and related health problems 10th revision (2016).
1. Organization, W. H. ICD-11 for mortality and morbidity statistics (icd-11 mms) 2018 version (2018).
1. Editorial. Icd-11. Lancet393, 2275. 10.1016/s0140-6736(19)31205-x (2019).
1. Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Characterisation, identification, clustering, and classification of disease

Affiliations

Characterisation, identification, clustering, and classification of disease

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources