Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 13;35(2):108975.
doi: 10.1016/j.celrep.2021.108975.

Merged Affinity Network Association Clustering: Joint multi-omic/clinical clustering to identify disease endotypes

Affiliations

Merged Affinity Network Association Clustering: Joint multi-omic/clinical clustering to identify disease endotypes

Scott R Tyler et al. Cell Rep. .

Abstract

Although clinical and laboratory data have long been used to guide medical practice, this information is rarely integrated with multi-omic data to identify endotypes. We present Merged Affinity Network Association Clustering (MANAclust), a coding-free, automated pipeline enabling integration of categorical and numeric data spanning clinical and multi-omic profiles for unsupervised clustering to identify disease subsets. Using simulations and real-world data from The Cancer Genome Atlas, we demonstrate that MANAclust's feature selection algorithms are accurate and outperform competitors. We also apply MANAclust to a clinically and multi-omically phenotyped asthma cohort. MANAclust identifies clinically and molecularly distinct clusters, including heterogeneous groups of "healthy controls" and viral and allergy-driven subsets of asthmatic subjects. We also find that subjects with similar clinical presentations have disparate molecular profiles, highlighting the need for additional testing to uncover asthma endotypes. This work facilitates data-driven personalized medicine through integration of clinical parameters with multi-omics. MANAclust is freely available at https://bitbucket.org/scottyler892/manaclust/src/master/.

Keywords: bioinformatics; categorical clustering; clinical data; clustering; data integration; endotypes; feature selection; multi-omics; personalized medicine; systems biology.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. MANAclust pipeline
(A) Clinical data hold key information for defining diseases, yet they are often ignored by traditional (multi-)omics studies. MANAclust enables feature selection and merged analyses of categorical and continuous clinical data along with traditional multi-omics to enable discovery of disease subtypes. (B) Categorical and numeric datasets are fed into the program, after which feature selection is performed (see STAR Methods for details on algorithms). Normalized affinity matrices are calculated across all omes; these affinity matrices are then merged by taking their missing value compatible average. We combined the strengths of Louvain modularity and affinity propagation into a new clustering algorithm that is then used for final cluster (FC) assignment on the combined affinity matrix. FCs are then examined within each input dataset to determine whether the FCs differ within each given dataset to identify the consensus groups. Post-clustering analyses are also performed to identify the significant differences across each FC and consensus group for all datasets. Analyses are then collated and displayed in a webpage format.
Figure 2.
Figure 2.. Categorical feature selection
(A) The observed dataset, which will often consist of both structured, meaningful, real features and randomly distributed features, is shuffled for 10 iterations to generate a background null distribution for feature selection. (B) To select real features, each feature is compared with all other features, creating a contingency table of all observations for each feature pair. (C) The mutual information for each feature pair is then calculated. This process is performed both with the observed dataset and its iteratively shuffled versions to create an accurate background for comparison. (D) We then calculated a background-corrected stack of information difference matrices, having subtracted the null backgrounds from the observed cross-feature mutual information. (E) This difference matrix stack is then flattened to a single two-dimensional matrix of feature-pairwise mutual information difference by taking the minimum for each feature pair. This is essentially the worst-case scenario in which there was the least amount of difference between the original dataset and one of the shuffled datasets. (F) To select meaningful features, we calculated the row-wise maximum from the minimum difference matrix. This results in a one-dimensional vector corresponding to the maximum amount of relative information contained in all pairwise comparisons for each individual feature. A cutoff is then applied to select the high- and low-information features, retaining the high-information features for downstream log-loss and affinity matrix calculations. See Figure S1 for a summary of the mathematical approach to feature selection on numeric datasets. See also Figure S2 for a summary of the accuracy of these feature selection methods.
Figure 3.
Figure 3.. Comparison of MANAclust with other multi-omic clustering algorithms on TCGA data and simulated categorical data
We compared MANAclust’s ability to identify clinically distinct clusters with five other existing algorithms: two non-multi-omics-specialized algorithms, K-means or spectral clustering on concatenated datasets; and four specialized multi-omics algorithms, PINS (Nguyen et al., 2017), PINSPlus (Nguyen et al., 2019), Similarity Network Fusion (SNF) (Wang et al., 2014), and sparse multiple canonical correlation analysis (MCCA) (Witten and Tibshirani, 2009). (A) MANAclust finds the most significant differences in identified FCs in 10 different cancer types taken in aggregate. The clusters identified by each method were compared with each other for significant differences for survival rate and group segregation based on clinical attributes. Displayed are the sum of the −log10(P values) for each cancer type and the sum of the number of statistically significant differences in clinical attributes for each method. See Table S1 for all tabulated results. (B) All algorithms were also benchmarked using single-omes rather than in a multi-omic manner. In the top row of plots, each ome type (gene expression, DNA methylation, miRNA expression) denotes the sum across all cancer types for the −log10(P values) for significance in survival differences across groups (x axis) and the sum of the total number of enriched clinical parameters (y axis) for all cancer types combined. Note that MCCA works only with multi-omics data and is therefore not included in the single-omics comparison. (C) To assess the ability of MANAclust to accurately incorporate categorical data, we performed a synthetic dataset benchmark comparing MANAclust’s categorical clustering (orange) with KModes clustering using elbow rule on within-group sum of Hamming distances. MANAclust was slightly more accurate than KModes elbow rule in selecting the appropriate number of clusters. However, MANAclust significantly outperformed KModes in clustering purity (F = 445, P = 1.38e–22, one-way ANOVA) and relative mutual information (F = 53.7, P = 8.75e–9, one-way ANOVA). Categorical simulations included 5, 10, 15, and 20 groups with 1,000 subjects per simulation; each scenario was simulated five times. See Figure S3 for benchmarking of MANAclust’s feature selection algorithms.
Figure 4.
Figure 4.. Clinical, methylation, transcriptome, and microbiome consensus groups
(A–D) For each input dataset, the large-scale take-home message from the top enriched pathways or variables that characterized the given consensus groups is shown. Each consensus group has a unique color that maps consistently throughout the figure. A thick border is drawn around consensus groups that uniquely map to a single consensus group of all other datasets. In other words, if one can determine this given ome, one can universally determine the subject’s group membership across other omes. (A) Pathway analysis was performed for the transcriptome-level consensus groups using PyMINEr; selected pathways were prioritized using PyMINEr’s individual class importance metric (Table S2C) (Reimand et al., 2007; Tyler et al., 2019). (B) For methylome pathway analyses, methylation loci were filtered only for those with promoter annotations, using the given promoter as genes for analysis by PyMINEr (Table S2D) (Reimand et al., 2007; Tyler et al., 2019). (C) Relative abundance (by percentage) for the four most prevalent bacterial genera in each of the microbiome consensus groups. A note on the consensus group alpha diversity is also made beneath genus level quantifications; alpha diversity measures are shown directly in Figure S4D. (D) Each consensus group shows the percent abundance of asthma, proportion of allergen-specific IgE (a marker of sensitization to specific allergens), and serum total IgE, normalized to mean total IgE of the highest group. See STAR Methods for details on quantification of the allergen-specific IgE calculation. Bars indicate means and errors are standard error. (E) A circular graph network showing each data type’s consensus groups and their connections to other consensus groups in different data types. All the subjects from within that consensus group of a given data type are examined and if at least one subject within the given consensus group maps to a consensus group from a different data type, those consensus groups are connected. The edges are weighted by the Bayesian probability of consensus group membership for the target given the source of the arrow. See Figure S4 for more details. (F) The FCs are characterized by their membership in their given consensus groups for each data type. Colors within the table are consistent with those in (A)–(E). For each FC, the relative percentage with asthma is shown (gray bars) along with asthma control test (ACT) scores (blue bars), a measure of asthma severity. Bars indicate mean ± standard error. (G) The significant subset of (E); only edges that showed significant concordance between the two consensus groups are shown. Edge thickness corresponds to the probability of membership in the target consensus group, given that a subject is a member of the source consensus group of a different dataset. Consensus group nodes are colorized based on their data type only. (H) A boxplot showing the maximum level of significance (with FCs or consensus groups) using all data, clinical data only, all numeric data, or each numeric data type on its own. Using all data resulted in the greatest average significance of held-out variables. (I) This finding was consistent, including when subtracting each variable’s instance of least significance to normalize for the different baselines within a variable.

Similar articles

Cited by

References

    1. Altman MC, Gill MA, Whalen E, Babineau DC, Shao B, Liu AH, Jepson B, Gruchalla RS, O’Connor GT, Pongracic JA, et al. (2019). Transcriptome networks identify mechanisms of viral and nonviral asthma exacerbations in children. Nat. Immunol 20, 637–651. - PMC - PubMed
    1. Andrews TS, and Hemberg M (2018). Identifying cell populations with scRNASeq. Mol. Aspects Med 59, 114–122. - PubMed
    1. Bunyavanich S, and Schadt EE (2015). Systems biology of asthma and allergic diseases: a multiscale approach. J. Allergy Clin. Immunol 135, 31–42. - PMC - PubMed
    1. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335–336. - PMC - PubMed
    1. Chun Y, Do A, Grishina G, Grishin A, Fang G, Rose S, Spencer C, Vicencio A, Schadt E, and Bunyavanich S (2020). Integrative study of the upper and lower airway microbiome and transcriptome in asthma. JCI Insight 5, e133707. - PMC - PubMed

Publication types

MeSH terms