Consensus clustering with missing labels (ccml): a consensus clustering tool for multi-omics integrative prediction in cohorts with unequal sample coverage

doi:10.1093/bib/bbad501

. 2023 Nov 22;25(1):bbad501.

doi: 10.1093/bib/bbad501.

Consensus clustering with missing labels (ccml): a consensus clustering tool for multi-omics integrative prediction in cohorts with unequal sample coverage

Chuan-Xing Li¹, Hongyan Chen², Nazanin Zounemat-Kermani^{3

4}, Ian M Adcock^{3

4}, C Magnus Sköld^{1

5}, Meng Zhou², Åsa M Wheelock^{1

5}; U-BIOPRED study group

Affiliations

¹ Respiratory Medicine Unit, Department of Medicine Solna & Centre for Molecular Medicine, Karolinska Institutet.
² School of Biomedical Engineering, Wenzhou Medical University, Wenzhou, China.
³ National Heart and Lung Institute, Faculty of Medicine, Imperial College London, London, United Kingdom.
⁴ Data Science Institute, Imperial College London, London, United Kingdom.
⁵ Department of Respiratory Medicine and Allergy, Karolinska University Hospital Solna, Stockholm, Sweden.

PMID: 38205966
PMCID: PMC10782800
DOI: 10.1093/bib/bbad501

Consensus clustering with missing labels (ccml): a consensus clustering tool for multi-omics integrative prediction in cohorts with unequal sample coverage

Chuan-Xing Li et al. Brief Bioinform. 2023.

. 2023 Nov 22;25(1):bbad501.

doi: 10.1093/bib/bbad501.

Authors

Chuan-Xing Li¹, Hongyan Chen², Nazanin Zounemat-Kermani^{3

4}, Ian M Adcock^{3

4}, C Magnus Sköld^{1

5}, Meng Zhou², Åsa M Wheelock^{1

5}; U-BIOPRED study group

Affiliations

¹ Respiratory Medicine Unit, Department of Medicine Solna & Centre for Molecular Medicine, Karolinska Institutet.
² School of Biomedical Engineering, Wenzhou Medical University, Wenzhou, China.
³ National Heart and Lung Institute, Faculty of Medicine, Imperial College London, London, United Kingdom.
⁴ Data Science Institute, Imperial College London, London, United Kingdom.
⁵ Department of Respiratory Medicine and Allergy, Karolinska University Hospital Solna, Stockholm, Sweden.

PMID: 38205966
PMCID: PMC10782800
DOI: 10.1093/bib/bbad501

Abstract

Multi-omics data integration is a complex and challenging task in biomedical research. Consensus clustering, also known as meta-clustering or cluster ensembles, has become an increasingly popular downstream tool for phenotyping and endotyping using multiple omics and clinical data. However, current consensus clustering methods typically rely on ensembling clustering outputs with similar sample coverages (mathematical replicates), which may not reflect real-world data with varying sample coverages (biological replicates). To address this issue, we propose a new consensus clustering with missing labels (ccml) strategy termed ccml, an R protocol for two-step consensus clustering that can handle unequal missing labels (i.e. multiple predictive labels with different sample coverages). Initially, the regular consensus weights are adjusted (normalized) by sample coverage, then a regular consensus clustering is performed to predict the optimal final cluster. We applied the ccml method to predict molecularly distinct groups based on 9-omics integration in the Karolinska COSMIC cohort, which investigates chronic obstructive pulmonary disease, and 24-omics handprint integrative subgrouping of adult asthma patients of the U-BIOPRED cohort. We propose ccml as a downstream toolkit for multi-omics integration analysis algorithms such as Similarity Network Fusion and robust clustering of clinical data to overcome the limitations posed by missing data, which is inevitable in human cohorts consisting of multiple data modalities. The ccml tool is available in the R language (https://CRAN.R-project.org/package=ccml, https://github.com/pulmonomics-lab/ccml, or https://github.com/ZhoulabCPH/ccml).

Keywords: consensus clustering; missing labels; multi-omics integration; predictive labels; unequal sample coverage.

PubMed Disclaimer

Figures

**Figure 1**
The input of the NCWs algorithm is a matrix is the same as for the original CWs algorithm, where each row is a subject in the cohort and each column is a clustering result from each of the possible data modality combinations. For the exemplification using the COSMIC cohort, the input would consist of the 607 possible networks generated from all possible combinations of the 9 available omics datasets. In the original CW, a sample matrix is generated where each cell reports the fraction of the total number of clustering iterations where the sample pair is clustered together, the pairwise CW matrix, with range 0–1 (upper panel). The input cluster assignment matrices are permutated column-wise, while keeping the missing data points (lower panel). The CW calculated as described above is then inserted into the permutated pairwise consensus distribution, and the probability (P) of the CW coming from the permutated distribution is calculated. NCW is calculated as 1 − P of the distribution.

**Figure 2**
Example applications of multi-omics integrative clustering by means of ccml in two distinct cohorts: The Karolinska COSMIC cohort investigation COPD (A–E) and the U-BIOPRED cohort investigating severe asthma (F). (A) Correlation of the NCWs (y-axis) and original CWs (x-axis) strategies for omics combinations with existing sample rates of ≥40%, and networks consisting of five or more omics platforms. NCW (y-axis) provides a more relevant indication of the similarity in the clustering of subjects between iterations, as the weights are based on the significance level compared to a permutated background level of pairwise clustering (see Figure 1). Colors indicate n-omics platforms included, with red representing 10% or less of the platforms included, and green representing 90% or more of platforms included. (B) Stability of permutation numbers (*nperm*) for the NCW algorithm, using sample pairs with ≥40% existing sample rates for the respective omics combination, and equal to or larger than five omics data modalities available. (C) Boxplot displaying the accuracy of group prediction, estimated as normalized mutual information (NMI) between the true label of the COSMIC cohort and predictive clusters generated by (i) two different clustering methods, SC and hierarchical clustering (HC), (ii) two weights NCW or CWs, and (iii) the two omics integration strategies EQUAL or LARGER. Data shown as median with interquartile range. (D) The accuracy of group prediction, estimated as the normalized mutual information (NMI) compared to the defined cohort subgroups, as a function of n-omics platforms integrated. SC was performed in combination with NCW or CW, and EQUAL or LARGER strategies, respectively. Whereas all strategies merge at 5-tuple omics integration, NCW with the LARGER strategy provides more robust NMI at 3–4-tuple integration. (E) Heatmap illustrating the predictive accuracy (NMI; heatmap with dark red=1 and dark blue =0) in relation to the number of omics platforms included (x-axis) and the threshold for missingness (existing sample rate, y-axis) when using SC combined with NCW and the LARGER strategy. For the dataset at hand, a higher number of omics datasets with a more permissive threshold for missing data provides a better predictive power than that of fewer data platforms with a lower threshold for missing data. (F) Boxplot (median with interquartile range) of NMI accuracy between final ensembled predicted label output from ccml with predictive clusters by SC of each multi-omics combinations in EQUAL or LARGER strategies, respectively, using the U-BIOPRED cohort. The overall discrepancy in NMI seen between panels (A) and (F) are a reflection of the drastically different inclusion strategies and data missingness of the two cohorts.

See this image and copyright information in PMC

Cited by

An effective heuristic for developing hybrid feature selection in high dimensional and low sample size datasets.
Shin H, Oh S. Shin H, et al. BMC Bioinformatics. 2024 Dec 26;25(1):390. doi: 10.1186/s12859-024-06017-9. BMC Bioinformatics. 2024. PMID: 39722052 Free PMC article.
Attention mechanism models for precision medicine.
Cheng L. Cheng L. Brief Bioinform. 2024 May 23;25(4):bbae156. doi: 10.1093/bib/bbae156. Brief Bioinform. 2024. PMID: 38811359 Free PMC article.

References

1. Li CX, Gao J, Zhang Z, et al. Multiomics integration-based molecular characterizations of COVID-19. Brief Bioinform 2022;23:bbab485. - PMC - PubMed
1. Li CX, Wheelock CE, Skold CM, et al. Integration of multi-omics datasets enables molecular classification of COPD. Eur Respir J 2018;51:1701930. - PubMed
1. Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights 2020;14:117793221989905. - PMC - PubMed
1. Huang S, Chaudhary K, Garmire LX. More is better: recent progress in multi-omics data integration methods. Front Genet 2017;8:84. - PMC - PubMed
1. Sathyanarayanan A, Gupta R, Thompson EW, et al. A comparative study of multi-omics integration tools for cancer driver gene identification and tumour subtyping. Brief Bioinform 2020;21:1920–36. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Consumer Health Information
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

[1] Li CX, Gao J, Zhang Z, et al. Multiomics integration-based molecular characterizations of COVID-19. Brief Bioinform 2022;23:bbab485. - PMC - PubMed

[2] Li CX, Gao J, Zhang Z, et al. Multiomics integration-based molecular characterizations of COVID-19. Brief Bioinform 2022;23:bbab485. - PMC - PubMed

[3] Li CX, Wheelock CE, Skold CM, et al. Integration of multi-omics datasets enables molecular classification of COPD. Eur Respir J 2018;51:1701930. - PubMed

[4] Li CX, Wheelock CE, Skold CM, et al. Integration of multi-omics datasets enables molecular classification of COPD. Eur Respir J 2018;51:1701930. - PubMed

[5] Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights 2020;14:117793221989905. - PMC - PubMed

[6] Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights 2020;14:117793221989905. - PMC - PubMed

[7] Huang S, Chaudhary K, Garmire LX. More is better: recent progress in multi-omics data integration methods. Front Genet 2017;8:84. - PMC - PubMed

[8] Huang S, Chaudhary K, Garmire LX. More is better: recent progress in multi-omics data integration methods. Front Genet 2017;8:84. - PMC - PubMed

[9] Sathyanarayanan A, Gupta R, Thompson EW, et al. A comparative study of multi-omics integration tools for cancer driver gene identification and tumour subtyping. Brief Bioinform 2020;21:1920–36. - PMC - PubMed

[10] Sathyanarayanan A, Gupta R, Thompson EW, et al. A comparative study of multi-omics integration tools for cancer driver gene identification and tumour subtyping. Brief Bioinform 2020;21:1920–36. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Consensus clustering with missing labels (ccml): a consensus clustering tool for multi-omics integrative prediction in cohorts with unequal sample coverage

Affiliations

Consensus clustering with missing labels (ccml): a consensus clustering tool for multi-omics integrative prediction in cohorts with unequal sample coverage

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous