Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2022 Feb;7(2):238-250.
doi: 10.1038/s41564-021-01030-7. Epub 2022 Jan 27.

Multi-kingdom microbiota analyses identify bacterial-fungal interactions and biomarkers of colorectal cancer across cohorts

Affiliations
Multicenter Study

Multi-kingdom microbiota analyses identify bacterial-fungal interactions and biomarkers of colorectal cancer across cohorts

Ning-Ning Liu et al. Nat Microbiol. 2022 Feb.

Abstract

Despite recent progress in our understanding of the association between the gut microbiome and colorectal cancer (CRC), multi-kingdom gut microbiome dysbiosis in CRC across cohorts is unexplored. We investigated four-kingdom microbiota alterations using CRC metagenomic datasets of 1,368 samples from 8 distinct geographical cohorts. Integrated analysis identified 20 archaeal, 27 bacterial, 20 fungal and 21 viral species for each single-kingdom diagnostic model. However, our data revealed superior diagnostic accuracy for models constructed with multi-kingdom markers, in particular the addition of fungal species. Specifically, 16 multi-kingdom markers including 11 bacterial, 4 fungal and 1 archaeal feature, achieved good performance in diagnosing patients with CRC (area under the receiver operating characteristic curve (AUROC) = 0.83) and maintained accuracy across 3 independent cohorts. Coabundance analysis of the ecological network revealed associations between bacterial and fungal species, such as Talaromyces islandicus and Clostridium saccharobutylicum. Using metagenome shotgun sequencing data, the predictive power of the microbial functional potential was explored and elevated D-amino acid metabolism and butanoate metabolism were observed in CRC. Interestingly, the diagnostic model based on functional EggNOG genes achieved high accuracy (AUROC = 0.86). Collectively, our findings uncovered CRC-associated microbiota common across cohorts and demonstrate the applicability of multi-kingdom and functional markers as CRC diagnostic tools and, potentially, as therapeutic targets for the treatment of CRC.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the patient populations with CRC included in this study and their associated gut microbiome compositions.
a, Global map representing a total of 1,368 samples from 8 patient populations with faecal shotgun metagenomic data. The discovery data populations included Austria (AUS, PRJEB7774), France (FRA, PRJEB6070), Germany (GER, PRJEB27928), China (CHN_HK, PRJEB10878) and Japan (JPN, PRJDB4176). The validation data populations included the United States (USA, PRJEB12449), Italy (ITA, SRP136711) and China (CHN_SH, in-house). The numbers in brackets represents sample size. Details are shown in Supplementary Data 1. b, Alpha diversity measured by Shannon index of patients with CRC (red, n = 491) and control individuals (blue, n = 494). Adjusted P value (FDR = 2.579 × 10−4, two-sided test) was calculated by MMUPHin. Data are shown via the interquartile ranges (IQRs) with the median as a black horizontal line and the whiskers extending up to the most extreme points within 1.5× the IQR; outliers are represented as dots. c, Principal coordinate analysis (PCoA) of samples from all five cohorts based on Bray–Curtis distance, which shows that microbial composition was different between groups (P = 0.001) and cohorts (P = 0.001). P values of beta diversity based on Bray–Curtis distance were calculated with PERMANOVA by 999 permutations (two-sided test). The group is colour-coded and the cohort is indicated by different shapes. d, UpSet plot showing the number of differential bacterial species identified via MaAsLin2 in each population and shared by combinations of datasets. The number above each column represents the size of differential species. The set size on the right represents the number of differential species in each cohort and the connected dots represent the common differential species across connected cohorts. e, UpSet plot showing the number of differential fungal species identified via MaAsLin2 in each population and shared by combinations of datasets. The number above each column represents the size of differential species. The set size on the right represents the number of differential species in each cohort and the connected dots represent the common differential species across connected cohorts. Source data
Fig. 2
Fig. 2. Differential species across populations and prediction performances of models constructed with each single-kingdom features.
a, Phylogenetic tree showing the union of differential fungal species (231 in total), grouped by the phyla Ascomycota, Basidiomycota, Chytridiomycota, Cryptomycota, Microsporidia, Mucoromycota and Zoopagomycotina. The outer circles are marked for significant differential species (P < 0.05, two-sided test) in each cohort; the meta-analysis results were identified via MMUPHin (meta-ring) with orange for increased species and green for decreased species. Species marked with purple stars represent features selected in the classification model. The bar plots show the abundance fold change normalized by the log of marker features in each population. The number represents the marker number, which is marked with a star. The colour represents the population and the red bars are the fold change of all individuals in the CRC and control groups. b, The heatmap shows the AUROC values of models built with single-kingdom features in each cohort. The values refer to the average value of the 20× fivefold cross-validation. The asterisk represents the significance of models assessed with 1,000 permutations (two-sided test). *P = 0.001. c, Box plots showing the cohort-to-cohort AUROC values of the models using the features of each kingdom. Data are shown via the IQRs, with the median as a black horizontal line and the whiskers extending up to the most extreme points within the 1.5× the IQR (n = 4). Source data
Fig. 3
Fig. 3. Performance of predictive models constructed with combined multi-kingdom features and the integrated importance of these essential features in each geographical cohort.
a, Heatmap showing the AUROC values of the models built with multi-kingdom features in each cohort. The values refer to an average value of 20× repeated fivefold cross-validation. The asterisk represents the significance of models assessed with 1,000 permutations (two-sided test). *P = 0.001. A, Archaea; B, Bacteria; F, Fungi; V, Virus. b, Box plots showing the AUROC values of cohort-to-cohort transfer validation for the models using multi-kingdom features. Data are shown via the IQRs with the median as a black horizontal line and the whiskers extending up to the most extreme points within 1.5× the IQR (n = 4). c, Importance of each listed feature (belonging to the four-kingdom model) by the cross-validation of predictive performance for each population dataset as estimated using the internal random forest ‘Gini importance’ method. The ‘aggregation’ column shows the integrated ranks (using a rank aggregation algorithm) of listed markers within each cohort along with changes in abundance (differentials), with red indicating a species increase and blue indicating a species decrease in patients with CRC compared to controls. d, AUROC matrix of models built with the panel of 16 multi-kingdom features for CRC detection. Values on the diagonal refer to the average AUROC of 20× repeated fivefold stratified cross-validations. Values off the diagonal refer to the AUROCs obtained by training the model on the population of the corresponding row and applying it to the population of the corresponding column. The LOCO row refers to the performances obtained by training the model on the 16 microbial features using all but the population dataset of the corresponding column and applying it to the dataset of the corresponding column. The asterisk represents the significance of models assessed with 1,000 permutations (two-sided test). *P = 0.001. e, AUROC matrix of models built with the panel of 16 multi-kingdom features for CRC early detection. Values on the diagonal refer to the average AUROC of 20× repeated fivefold stratified cross-validations. Values off the diagonal refer to the AUROCs obtained by training the model on the population of the corresponding row and applying it to the population of the corresponding column. The LOCO row refers to the performances obtained by training the model on the 16 microbial features using all but the population dataset of the corresponding column and applying it to the dataset of the corresponding column. The asterisk represents the significance of models assessed with 1,000 permutations (two-sided test). *P = 0.001. Source data
Fig. 4
Fig. 4. Coabundance correlations among multi-kingdom species in patients with CRC and controls.
a, Coabundance networks involving combined differential species from all four kingdoms in the CRC and control samples. The colours of nodes indicate species from bacteria (green), fungi (orange), archaea (blue) and viruses (purple). Only significant (FDR < 0.00001, two-sided tests of 1,000 permutations) absolute correlations above 0.3 are shown, which are considered as fair correlations. The purple lines indicate positive species interactions; the grey lines indicate negative interactions. b, Moderate coabundance networks in controls and patients with CRC, with absolute correlations above 0.6 and with a significance cut-off of FDR < 0.00001 (two-sided tests of 1,000 permutations). The edges are coloured according to the magnitude of the association in the moderate networks as shown by the colour bar. Source data
Fig. 5
Fig. 5. CRC-associated functional alterations and performance of models constructed with KO genes.
a, The box plots (left) show the relative abundance of the pathway of controls (blue bar) and patients with CRC (red bar) in each cohort. The number of samples was AUS (patients with CRC = 46, controls = 63), FRA (patients with CRC = 53, controls = 61), GER (patients with CRC = 60, controls = 65), CHN (patients with CRC = 80, controls = 86), JPN (patients with CRC = 258, controls = 251), respectively. All box plots represent the 25th–75th percentile of the distribution; the median is shown as a thick line in the middle of the box; the whiskers extend up to the most extreme points within a 1.5× the IQR and outliers are represented as dots. The heatmap (centre) shows the integrated meta-analysis that identified significantly changed KO gene expression in each metabolic pathway examined across five geographical populations. The cell colour and intensity represent the generalized abundance fold change of KO genes. The significant differential KO gene (P < 0.05, two-sided test) was identified via MMUPHin. P values are shown in the cells. b, Normalized log abundance for the functional genes bdhA/B (K00100), oraE (K17898) and oraS (K17899) is compared between controls (n = 494) and patients with CRC (n = 491). Statistical significance was determined via MMUPHin with treating age, BMI and sex as covariates (two-sided test). c,d, Expression of bdhA and bdhB in the butanoate metabolism pathway (c) and oraE and oraS in the D-arginine and D-ornithine metabolism pathway (d) were upregulated in patients with CRC (n = 24) than controls (n = 24) determined via qPCR with gDNA. Data are presented as the mean ± s.d. of three biological replicates. P values were calculated using a two-sided Wilcoxon signed-rank test and were Bonferroni-adjusted. The box plots show the IQRs as boxes, with the median as a black horizontal line and the whiskers extending up to the most extreme points within the 1.5× the IQR. e, AUROC matrix of models built with the 175 important EggNOG genes. Values on the diagonal refer to the average AUROC of 20× repeated fivefold stratified cross-validations. Values off the diagonal refer to the AUROCs obtained by training the model on the population of the corresponding row and applying it to the population of the corresponding column. The LOCO row refers to the performances obtained by training the model using all but the cohort dataset of the corresponding column and applying it to the dataset of the corresponding column. The asterisk represents the significance of models assessed with 1,000 permutations (two-sided test). *P = 0.001. Source data
Extended Data Fig. 1
Extended Data Fig. 1. Overview of microbial composition in four kingdoms across populations.
Microbial composition of four kingdoms in CRC and control (CTR) group, respectively. Composition of bacteria, fungi and archaea was shown at phylum level and composition of virus was shown at family level. Only the abundant phyla are shown in the pie chart and the rare family are summed into others. Source data
Extended Data Fig. 2
Extended Data Fig. 2. Differential species of archaea and virus among populations.
a, UpSet plot showing the number of differential archaea species identified via MaAsLin2 in each population and shared by combinations of datasets. The number on top of each column represents the size of differential species. The set size on the right represents the number of differential species in each cohort and connected dots represent the common differential species across connected cohorts. b, UpSet plot showing the number of differential viral species identified via MaAsLin2 in each population and shared by combinations of datasets. The number on top of each column represents the size of differential species. The set size on the right represents the number of differential species in each cohort and connected dots represent the common differential species across connected cohorts. Source data
Extended Data Fig. 3
Extended Data Fig. 3. Differential bacterial species across populations and prediction performances of models constructed with each single-kingdom features.
Phylogenetic tree showing the union of different bacterial species (88 in total), grouped by the phyla grouped in the phyla Actinobacteria, Basidiomycota, Proteobacteria, Bacteroidetes, Firmicutes and so on. The outer circles are marked for significant differential species (p < 0.05, two-sided test) in each population and the meta-analysis results identified via MMUPHin with orange for increased species and green for decreased species. Species marked with purple stars were features selected in the classification model. Bar plots show the abundances fold change (FC) normalized by log of marker features in each population. The number represents the marker number marked with star. Color represents population and red bars are FC of all subjects in CRC and CTR. Source data
Extended Data Fig. 4
Extended Data Fig. 4. Differential archaeal species across populations and prediction performances of models constructed with each single-kingdom features.
a, Phylogenetic tree showing the union of differential archaeal species (38 in total), grouped by the phyla grouped in the phyla Crenarchaeota, Euryarchaeota, Thaumarchaeota and Candidatus Korarchaeota. The outer circles are marked for significant differential species (p < 0.05, two-sided test) in each population and the meta-analysis results identified via MMUPHin with orange for increased species and green for decreased species. Species marked with purple stars were features selected in the classification model. Bar plots show the abundances fold change (FC) normalized by log of marker features in each population. The number represents the marker number marked with star. Color represents population and red bars are FC of all subjects in CRC and CTR. b, Bar plots show the abundances fold change (FC) normalized by log of viral marker features in each population. Color represents population and red bars are FC of all subjects in CRC and CTR. Source data
Extended Data Fig. 5
Extended Data Fig. 5. Performance of LOCO analysis in single- and multi-kingdom models.
The AUROC values of LOCO analysis in single-kingdom (a) and multi-kingdom (b) models. The asterisk represents the significance of models assessed with 1000 permutations (two-sided test). *: p = 0.001. A: Archaea; B: Bacteria; F: Fungi and V: Virus. Source data
Extended Data Fig. 6
Extended Data Fig. 6. The abundance changes of the best panel of 16 multi-kingdom features changes in CTR (n = 494), early- stage CRC (n = 318) and advanced CRC (n = 173).
The p values were calculated via MMUPHin (two-sided test). Data were showed via the interquartile ranges (IQRs) with the median as a black horizontal line and the whiskers extending up to the most extreme points within 1.5-fold IQR, and outliers are represented as dots. Source data
Extended Data Fig. 7
Extended Data Fig. 7. Validation of the multi-kingdom markers panel in independent cohorts and Specificity for CRC.
The diagnostic accuracy based on the multi-kingdom markers for independent cohorts one from China, two from Italy and one from USA, is indicated by the (a) cross-validation, (b) cohort-to-cohort and (c) LOCO analysis, which was trained with cohorts in discovery datasets. Bar height for cross-validation corresponds to the average of 20 time repeats (the error bars indicate the s.d., n = 20), and the bar height for cohort-to-cohort analysis corresponds to the average of five models (the error bars indicate the mean ± SD, n = 5). The diagnostic accuracy based on the multi-kingdom markers for CRC in independent cohorts and non-CRC disease, including IBD, T2D and PD. The AUROC values of CHN_SH (d) and ITA (e) were significantly higher than that of non-CRC disease. The AUROC values of USA (f) were significantly higher than T2D and PD but with no difference with IBD. Bar height for analysis corresponds to the average of 20 times for five-fold cross-validation models (the error bars indicate the mean ± SD, n = 20). All the p value was adjusted by ‘Bonferroni’ via two-sided dunn test after Kruskal-Wallis test. Source data
Extended Data Fig. 8
Extended Data Fig. 8. Associations between differential species from four-kingdom and differential functional pathways.
Heatmap shows Spearman correlations between bacterial or fungal species and different metabolic pathways as identified with HAlla. Source data
Extended Data Fig. 9
Extended Data Fig. 9. Associations between multi-kingdom markers and differential functional pathways.
Heatmap shows Spearman correlations between bacterial or fungal species and different metabolic pathways as identified via HAlla with default parameters. The p value was estimated from two-sided Benjamini-Hochberg-Yekutieli and exact p-value was shown in cells. Source data
Extended Data Fig. 10
Extended Data Fig. 10
AUROC matrix of models built with the (a) 47 important KO genes and (b) 20 KEGG pathways. Values on the diagonal refer to the average AUROC of 20-times repeated five-fold stratified cross-validations. Off-diagonal values refer to the AUROCs obtained by training the model on the population of the corresponding row and applying it to the population of the corresponding column. The LOCO row refers to the performances obtained by training the model using all but the cohort dataset of the corresponding column and applying it to the dataset of the corresponding column. Source data

References

    1. Keum N, Giovannucci E. Global burden of colorectal cancer: emerging trends, risk factors and prevention strategies. Nat. Rev. Gastroenterol. Hepatol. 2019;16:713–732. doi: 10.1038/s41575-019-0189-8. - DOI - PubMed
    1. Janney A, Powrie F, Mann EH. Host–microbiota maladaptation in colorectal cancer. Nature. 2020;585:509–517. doi: 10.1038/s41586-020-2729-3. - DOI - PubMed
    1. Yu T, et al. Fusobacterium nucleatum promotes chemoresistance to colorectal cancer by modulating autophagy. Cell. 2017;170:548–563.e16. doi: 10.1016/j.cell.2017.07.008. - DOI - PMC - PubMed
    1. Mariotto AB, Yabroff KR, Shao Y, Feuer EJ, Brown ML. Projections of the cost of cancer care in the United States: 2010–2020. J. Natl Cancer Inst. 2011;103:117–128. doi: 10.1093/jnci/djq495. - DOI - PMC - PubMed
    1. Dekker E, Tanis PJ, Vleugels JLA, Kasi PM, Wallace MB. Colorectal cancer. Lancet. 2019;394:1467–1480. doi: 10.1016/S0140-6736(19)32319-0. - DOI - PubMed

Publication types

MeSH terms