Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 23;34(6):967-978.
doi: 10.1101/gr.278637.123.

Global compositional and functional states of the human gut microbiome in health and disease

Affiliations

Global compositional and functional states of the human gut microbiome in health and disease

Sunjae Lee et al. Genome Res. .

Abstract

The human gut microbiota is of increasing interest, with metagenomics a key tool for analyzing bacterial diversity and functionality in health and disease. Despite increasing efforts to expand microbial gene catalogs and an increasing number of metagenome-assembled genomes, there have been few pan-metagenomic association studies and in-depth functional analyses across different geographies and diseases. Here, we explored 6014 human gut metagenome samples across 19 countries and 23 diseases by performing compositional, functional cluster, and integrative analyses. Using interpreted machine learning classification models and statistical methods, we identified Fusobacterium nucleatum and Anaerostipes hadrus with the highest frequencies, enriched and depleted, respectively, across different disease cohorts. Distinct functional distributions were observed in the gut microbiomes of both westernized and nonwesternized populations. These compositional and functional analyses are presented in the open-access Human Gut Microbiome Atlas, allowing for the exploration of the richness, disease, and regional signatures of the gut microbiota across different cohorts.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Characterization of the global gut microbiome in health and disease. Pan-metagenomics association studies of health and disease. Corresponding data sets were publicly shared as a resource: the Human Gut Microbiome Atlas (HGMA). (A) The geographical distribution of the data sets used in this study (the number of the samples is shown in parentheses). (B) Disease data sets of shotgun metagenomics used in this study. (C) The workflow of the metagenomic species pan-genome (MSP) quantification together with functional characterization. We first constructed 1989 MSPs for gut microbiome by MSPminer based on co-abundant gene profiles, which give clues to identify gene cluster markers likely belonging to the same species. Next, all the short reads aligned to the IGC2 catalog and, subsequently, gene abundances were profiled, downsized, and normalized. Based on co-abundant gene markers from the given MSP, mean signals were used to estimate species abundance profiles. In total, 6014 shotgun metagenome samples were aligned against the gene catalog of the human gut microbiome and quantified at the level of MSP. (D) Heatmap showing the top 20 significantly overrepresented MSPs between western and nonwestern cohorts colored by mean species Z-score for each country against all countries. (E) Monocle ordination of the gut microbiome. Individual samples from nonwestern and western countries were colored blue and orange, respectively. (F) Difference in gene content between western and nonwestern enriched species. Those species gene content was annotated by those that were CAZymes, antimicrobial-resistance (AMR) genes, and virulence factors (PATRIC database) and summed across all species. Total number of each gene was normalized and plotted as a stacked bar plot to show regional overrepresentation (Methods).
Figure 2.
Figure 2.
Pan-metagenomics association studies (Pan-MGAS) of 43 cohorts from 23 different diseases and 14 countries (n = 2185). (A) We identified significantly enriched (top) and depleted (bottom) species of cohorts based on the effect sizes (ESs) of Wilcoxon rank-sum one-sided tests (ES ≥ 0.3; i.e., each dot represents the ES of an MSP in each disease data set); the complete list of values is provided in Supplemental Table S5. The blue dotted line indicates ES = 0.3; the red dotted line indicates ES = 0.5; and each dot in the plot represents one MSP within one disease cohort. (B) Scatter plots of the frequency of the significantly enriched/depleted cohorts of all MSPs (ES > 0.3): Each point represents an MSP; all values in the plot are integers; and jitter was added to remove overlapping points. The y-axis displays the total frequency of enriched/depleted cohorts (number of enriched cohorts + number of depleted cohorts), and the x-axis displays the subtracted frequency between enriched cohorts and depleted cohorts (number of enriched cohorts − number of depleted cohorts). Point coloring is based on the number of different diseases for which an MSP had an ES above 0.3. Commonly enriched/depleted species among cohorts were identified when total frequency ≥ 3 and absolute subtracted frequency ≥ 2. (C) Species found depleted (Anaerostipes hadrus) and enriched (Fusobacterium nucleatum subsp. animalis) in most disease cohorts. The blue dotted line indicates ES = 0.3; the red dotted line indicates ES = 0.5; and each dot in the plot represents one MSP within one disease cohort. Acronyms are as follows: (ACVD) acute coronary cardiovascular disease, (Ob) obesity, (CRC) colorectal cancer, (NSCLC) non-small-cell lung cancer, (RCC) renal cell carcinoma, (GDM) gestational diabetes mellitus, (T1D) type 1 diabetes, (T2D) type 2 diabetes, (LC) liver cirrhosis, (NAFLD) nonalcoholic fatty acid liver, (UC) ulcerative colitis, (CD) Crohn's disease, (BD) Behçet's disease, (RA) rheumatoid arthritis, (SPA) ankylosing spondylitis, (ME/CFS) myalgic encephalomyelitis/chronic fatigue syndrome, and (PD) Parkinson's disease.
Figure 3.
Figure 3.
Analysis of functional clusters of the human gut microbiome. For the functional characterization of human gut MSPs, we annotated respective genes with 19,540 features of microbial function/phenotype databases and identified 7763 functional clusters better representing the microbiome. (A) Identification of functional clusters based on co-conserved molecular and biological functions across species. Unlike the manually curated module database, we identified functional clusters based on high co-conservation across species using the unsupervised clustering method. (B) The overall scheme of identification of functional clusters and checking functional coverage (cluster size) and taxonomic coverage (number of enriched species). (C) We found that among different sources of microbial functional annotations (e.g., KEGG module and pathway), co-conservation of molecular and biological functions across different species was substantially low (Jaccard index < 0.5). (D) Functional clusters identified by unsupervised community detection. The y-axis displays the number of genes within the functional cluster (i.e., functional coverage), and the x-axis displays the number of MSPs possessing >70% of the clusters’ genes (i.e., taxonomic coverage). (E) Functional clusters projected on enriched/depleted MSPs across disease cohorts. The scatter plot displays the frequency of functional clusters significantly associated with the enriched/depleted species (hypergeometric test P-value < 0.0001) in disease cohorts. Each point represents a gene cluster; all values in the plot are integers; and jitter was added to remove overlapping points. The y-axis shows the total frequency of cohorts in which a functional cluster was found significantly associated with enriched/depleted species. The x-axis shows the difference in the number of cohorts in which a function was found enriched minus the frequency it was found depleted. Point colors changed from red (left) to blue (right) according to x-axis values. Common enriched/depleted functional gene clusters among cohorts were identified when total frequency ≥ 3 and absolute subtracted frequency ≥ 2.
Figure 4.
Figure 4.
Random forest (RF) models trained on multiple cohorts to discriminate between disease and healthy controls. (A) Schematic of RF classification method. (B) AUROC scores for each disease RF classification model. (C) AUROC curves of an inter (top) and intra (bottom) cohort validation for a RF model that predicts CRC. (D) Box plot of directional mean absolute SHAP scores for all disease predictive models. Red and blue boxes represent species that were depleted/enriched using effect size calculation. (E) Clustered heatmap (dendrogram omitted) of the most important species for prediction of 16 diseases by RF classification as calculated by directional mean SHAP score (rows contain at least one species with directional mean SHAP score above 0.0125 in any of the diseases; Methods). Positive values indicate that higher relative abundance is more likely to classify the disease versus healthy samples. Negative values indicate that lower relative abundance is more likely to classify the disease versus healthy samples. The right color bar indicates mean species bias for enrichment or depletion in all diseases. Acronyms are as follows: (CRC) colorectal cancer, (NSCLC) non-small-cell lung cancer, (RCC) renal cell carcinoma, (T1D) type 1 diabetes, (T2D) type 2 diabetes, (LC) liver cirrhosis, (NAFLD) nonalcoholic fatty acid liver, (CD) Crohn's disease, (RA) rheumatoid arthritis, (SPA) ankylosing spondylitis, (ME_CFS) myalgic encephalomyelitis/chronic fatigue syndrome, (IGT) impaired glucose tolerance, and (VKH) Vogt–Koyanagi–Harada.

References

    1. Allen-Vercoe E, Daigneault M, White A, Panaccione R, Duncan SH, Flint HJ, O'Neal L, Lawson PA. 2012. Anaerostipes hadrus comb. nov., a dominant species within the human colonic microbiota; reclassification of Eubacterium hadrum Moore et al. 1976. Anaerobe 18: 523–529. 10.1016/j.anaerobe.2012.09.002 - DOI - PubMed
    1. Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, Finn RD. 2019. A new genomic blueprint of the human gut microbiota. Nature 568: 499–504. 10.1038/s41586-019-0965-1 - DOI - PMC - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. 10.1093/nar/25.17.3389 - DOI - PMC - PubMed
    1. Bar N, Korem T, Weissbrod O, Zeevi D, Rothschild D, Leviatan S, Kosower N, Lotan-Pompan M, Weinberger A, Le Roy CI, et al. 2020. A reference map of potential determinants for the human serum metabolome. Nature 588: 135–140. 10.1038/s41586-020-2896-2 - DOI - PubMed
    1. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al. 2004. The Pfam protein families database. Nucleic Acids Res 32(Database issue): D138–D141. 10.1093/nar/gkh121 - DOI - PMC - PubMed

LinkOut - more resources