Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 22;25(1):bbad501.
doi: 10.1093/bib/bbad501.

Consensus clustering with missing labels (ccml): a consensus clustering tool for multi-omics integrative prediction in cohorts with unequal sample coverage

Affiliations

Consensus clustering with missing labels (ccml): a consensus clustering tool for multi-omics integrative prediction in cohorts with unequal sample coverage

Chuan-Xing Li et al. Brief Bioinform. .

Abstract

Multi-omics data integration is a complex and challenging task in biomedical research. Consensus clustering, also known as meta-clustering or cluster ensembles, has become an increasingly popular downstream tool for phenotyping and endotyping using multiple omics and clinical data. However, current consensus clustering methods typically rely on ensembling clustering outputs with similar sample coverages (mathematical replicates), which may not reflect real-world data with varying sample coverages (biological replicates). To address this issue, we propose a new consensus clustering with missing labels (ccml) strategy termed ccml, an R protocol for two-step consensus clustering that can handle unequal missing labels (i.e. multiple predictive labels with different sample coverages). Initially, the regular consensus weights are adjusted (normalized) by sample coverage, then a regular consensus clustering is performed to predict the optimal final cluster. We applied the ccml method to predict molecularly distinct groups based on 9-omics integration in the Karolinska COSMIC cohort, which investigates chronic obstructive pulmonary disease, and 24-omics handprint integrative subgrouping of adult asthma patients of the U-BIOPRED cohort. We propose ccml as a downstream toolkit for multi-omics integration analysis algorithms such as Similarity Network Fusion and robust clustering of clinical data to overcome the limitations posed by missing data, which is inevitable in human cohorts consisting of multiple data modalities. The ccml tool is available in the R language (https://CRAN.R-project.org/package=ccml, https://github.com/pulmonomics-lab/ccml, or https://github.com/ZhoulabCPH/ccml).

Keywords: consensus clustering; missing labels; multi-omics integration; predictive labels; unequal sample coverage.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The input of the NCWs algorithm is a matrix is the same as for the original CWs algorithm, where each row is a subject in the cohort and each column is a clustering result from each of the possible data modality combinations. For the exemplification using the COSMIC cohort, the input would consist of the 607 possible networks generated from all possible combinations of the 9 available omics datasets. In the original CW, a sample matrix is generated where each cell reports the fraction of the total number of clustering iterations where the sample pair is clustered together, the pairwise CW matrix, with range 0–1 (upper panel). The input cluster assignment matrices are permutated column-wise, while keeping the missing data points (lower panel). The CW calculated as described above is then inserted into the permutated pairwise consensus distribution, and the probability (P) of the CW coming from the permutated distribution is calculated. NCW is calculated as 1 − P of the distribution.
Figure 2
Figure 2
Example applications of multi-omics integrative clustering by means of ccml in two distinct cohorts: The Karolinska COSMIC cohort investigation COPD (AE) and the U-BIOPRED cohort investigating severe asthma (F). (A) Correlation of the NCWs (y-axis) and original CWs (x-axis) strategies for omics combinations with existing sample rates of ≥40%, and networks consisting of five or more omics platforms. NCW (y-axis) provides a more relevant indication of the similarity in the clustering of subjects between iterations, as the weights are based on the significance level compared to a permutated background level of pairwise clustering (see Figure 1). Colors indicate n-omics platforms included, with red representing 10% or less of the platforms included, and green representing 90% or more of platforms included. (B) Stability of permutation numbers (nperm) for the NCW algorithm, using sample pairs with ≥40% existing sample rates for the respective omics combination, and equal to or larger than five omics data modalities available. (C) Boxplot displaying the accuracy of group prediction, estimated as normalized mutual information (NMI) between the true label of the COSMIC cohort and predictive clusters generated by (i) two different clustering methods, SC and hierarchical clustering (HC), (ii) two weights NCW or CWs, and (iii) the two omics integration strategies EQUAL or LARGER. Data shown as median with interquartile range. (D) The accuracy of group prediction, estimated as the normalized mutual information (NMI) compared to the defined cohort subgroups, as a function of n-omics platforms integrated. SC was performed in combination with NCW or CW, and EQUAL or LARGER strategies, respectively. Whereas all strategies merge at 5-tuple omics integration, NCW with the LARGER strategy provides more robust NMI at 3–4-tuple integration. (E) Heatmap illustrating the predictive accuracy (NMI; heatmap with dark red=1 and dark blue =0) in relation to the number of omics platforms included (x-axis) and the threshold for missingness (existing sample rate, y-axis) when using SC combined with NCW and the LARGER strategy. For the dataset at hand, a higher number of omics datasets with a more permissive threshold for missing data provides a better predictive power than that of fewer data platforms with a lower threshold for missing data. (F) Boxplot (median with interquartile range) of NMI accuracy between final ensembled predicted label output from ccml with predictive clusters by SC of each multi-omics combinations in EQUAL or LARGER strategies, respectively, using the U-BIOPRED cohort. The overall discrepancy in NMI seen between panels (A) and (F) are a reflection of the drastically different inclusion strategies and data missingness of the two cohorts.

Similar articles

Cited by

References

    1. Li CX, Gao J, Zhang Z, et al. Multiomics integration-based molecular characterizations of COVID-19. Brief Bioinform 2022;23:bbab485. - PMC - PubMed
    1. Li CX, Wheelock CE, Skold CM, et al. Integration of multi-omics datasets enables molecular classification of COPD. Eur Respir J 2018;51:1701930. - PubMed
    1. Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights 2020;14:117793221989905. - PMC - PubMed
    1. Huang S, Chaudhary K, Garmire LX. More is better: recent progress in multi-omics data integration methods. Front Genet 2017;8:84. - PMC - PubMed
    1. Sathyanarayanan A, Gupta R, Thompson EW, et al. A comparative study of multi-omics integration tools for cancer driver gene identification and tumour subtyping. Brief Bioinform 2020;21:1920–36. - PMC - PubMed

Publication types