Global compositional and functional states of the human gut microbiome in health and disease

Sunjae Lee^#^{1

2}, Theo Portlock^#³, Emmanuelle Le Chatelier^#⁴, Fernando Garcia-Guevara^#^{1

3}, Frederick Clasen¹, Florian Plaza Oñate⁴, Nicolas Pons⁴, Neelu Begum¹, Azadeh Harzandi¹, Ceri Proffitt¹, Dorines Rosario¹, Stefania Vaga¹, Junseok Park⁵, Kalle von Feilitzen³, Fredric Johansson³, Cheng Zhang³, Lindsey A Edwards^{1

6}, Vincent Lombard^{7

8}, Franck Gauthier⁴, Claire J Steves⁹, David Gomez-Cabrero^{1

10

11}, Bernard Henrissat^{12

13}, Doheon Lee⁵, Lars Engstrand¹⁴, Debbie L Shawcross⁶, Gordon Proctor¹, Mathieu Almeida⁴, Jens Nielsen^{15

16}, Adil Mardinoglu^{1

3}, David L Moyes¹, Stanislav Dusko Ehrlich^{4

17}, Mathias Uhlen¹⁸, Saeed Shoaie^{19

3}

Affiliations

¹ Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral & Craniofacial Sciences, King's College London, SE1 9RT, United Kingdom.
² School of Life Sciences, Gwangju Institute of Science and Technology (GIST), 61005, Gwangju, Republic of Korea.
³ Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, SE-171 21, Sweden.
⁴ University Paris-Saclay, INRAE, MetaGenoPolis, 78350 Jouy-en-Josas, France.
⁵ Department of Bio and Brain Engineering, KAIST, Yuseong-gu, Daejeon 305-701, Republic of Korea.
⁶ Institute of Liver Studies, Department of Inflammation Biology, School of Immunology and Microbial Sciences, King's College London, London SE5 9NU, United Kingdom.
⁷ INRAE, USC1408 Architecture et Fonction des Macromolécules Biologiques (AFMB), Marseille 13288, France.
⁸ Architecture et Fonction des Macromolécules Biologiques (AFMB), CNRS, Aix-Marseille University, Marseille 13288, France.
⁹ Department of Twin Research & Genetic Epidemiology, King's College London, London WC2R 2LS, United Kingdom.
¹⁰ Translational Bioinformatics Unit, Navarrabiomed, Universidad Pública de Navarra (UPNA), IdiSNA, 31008 Pamplona, Spain.
¹¹ Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.
¹² Department of Biological Sciences, King Abdulaziz University, Jeddah 21589, Saudi Arabia.
¹³ Department of Biotechnology and Biomedicine, Technical University of Denmark, DK-2800 Lyngby, Denmark.
¹⁴ Centre for Translational Microbiome Research (CTMR), Department of Microbiology, Tumour and Cell Biology, Karolinska Institutet, 171 65 Stockholm, Sweden.
¹⁵ Department of Biology and Biological Engineering, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden.
¹⁶ BioInnovation Institute, DK-2200 Copenhagen N, Denmark.
¹⁷ Department of Clinical and Movement Neurosciences, University College London, London NW3 2PF, United Kingdom.
¹⁸ Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, SE-171 21, Sweden; mathias.uhlen@scilifelab.se saeed.shoaie@kcl.ac.uk.
¹⁹ Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral & Craniofacial Sciences, King's College London, SE1 9RT, United Kingdom; mathias.uhlen@scilifelab.se saeed.shoaie@kcl.ac.uk.

^# Contributed equally.

PMID: 39038849
PMCID: PMC11293553
DOI: 10.1101/gr.278637.123

Global compositional and functional states of the human gut microbiome in health and disease

Sunjae Lee et al. Genome Res. 2024.

. 2024 Jul 23;34(6):967-978.

doi: 10.1101/gr.278637.123.

Authors

Affiliations

¹ Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral & Craniofacial Sciences, King's College London, SE1 9RT, United Kingdom.
² School of Life Sciences, Gwangju Institute of Science and Technology (GIST), 61005, Gwangju, Republic of Korea.
³ Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, SE-171 21, Sweden.
⁴ University Paris-Saclay, INRAE, MetaGenoPolis, 78350 Jouy-en-Josas, France.
⁵ Department of Bio and Brain Engineering, KAIST, Yuseong-gu, Daejeon 305-701, Republic of Korea.
⁶ Institute of Liver Studies, Department of Inflammation Biology, School of Immunology and Microbial Sciences, King's College London, London SE5 9NU, United Kingdom.
⁷ INRAE, USC1408 Architecture et Fonction des Macromolécules Biologiques (AFMB), Marseille 13288, France.
⁸ Architecture et Fonction des Macromolécules Biologiques (AFMB), CNRS, Aix-Marseille University, Marseille 13288, France.
⁹ Department of Twin Research & Genetic Epidemiology, King's College London, London WC2R 2LS, United Kingdom.
¹⁰ Translational Bioinformatics Unit, Navarrabiomed, Universidad Pública de Navarra (UPNA), IdiSNA, 31008 Pamplona, Spain.
¹¹ Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.
¹² Department of Biological Sciences, King Abdulaziz University, Jeddah 21589, Saudi Arabia.
¹³ Department of Biotechnology and Biomedicine, Technical University of Denmark, DK-2800 Lyngby, Denmark.
¹⁴ Centre for Translational Microbiome Research (CTMR), Department of Microbiology, Tumour and Cell Biology, Karolinska Institutet, 171 65 Stockholm, Sweden.
¹⁵ Department of Biology and Biological Engineering, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden.
¹⁶ BioInnovation Institute, DK-2200 Copenhagen N, Denmark.
¹⁷ Department of Clinical and Movement Neurosciences, University College London, London NW3 2PF, United Kingdom.
¹⁸ Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, SE-171 21, Sweden; mathias.uhlen@scilifelab.se saeed.shoaie@kcl.ac.uk.
¹⁹ Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral & Craniofacial Sciences, King's College London, SE1 9RT, United Kingdom; mathias.uhlen@scilifelab.se saeed.shoaie@kcl.ac.uk.

^# Contributed equally.

PMID: 39038849
PMCID: PMC11293553
DOI: 10.1101/gr.278637.123

Abstract

The human gut microbiota is of increasing interest, with metagenomics a key tool for analyzing bacterial diversity and functionality in health and disease. Despite increasing efforts to expand microbial gene catalogs and an increasing number of metagenome-assembled genomes, there have been few pan-metagenomic association studies and in-depth functional analyses across different geographies and diseases. Here, we explored 6014 human gut metagenome samples across 19 countries and 23 diseases by performing compositional, functional cluster, and integrative analyses. Using interpreted machine learning classification models and statistical methods, we identified Fusobacterium nucleatum and Anaerostipes hadrus with the highest frequencies, enriched and depleted, respectively, across different disease cohorts. Distinct functional distributions were observed in the gut microbiomes of both westernized and nonwesternized populations. These compositional and functional analyses are presented in the open-access Human Gut Microbiome Atlas, allowing for the exploration of the richness, disease, and regional signatures of the gut microbiota across different cohorts.

PubMed Disclaimer

Figures

**Figure 1.**
Characterization of the global gut microbiome in health and disease. Pan-metagenomics association studies of health and disease. Corresponding data sets were publicly shared as a resource: the Human Gut Microbiome Atlas (HGMA). (A) The geographical distribution of the data sets used in this study (the number of the samples is shown in parentheses). (B) Disease data sets of shotgun metagenomics used in this study. (C) The workflow of the metagenomic species pan-genome (MSP) quantification together with functional characterization. We first constructed 1989 MSPs for gut microbiome by MSPminer based on co-abundant gene profiles, which give clues to identify gene cluster markers likely belonging to the same species. Next, all the short reads aligned to the IGC2 catalog and, subsequently, gene abundances were profiled, downsized, and normalized. Based on co-abundant gene markers from the given MSP, mean signals were used to estimate species abundance profiles. In total, 6014 shotgun metagenome samples were aligned against the gene catalog of the human gut microbiome and quantified at the level of MSP. (D) Heatmap showing the top 20 significantly overrepresented MSPs between western and nonwestern cohorts colored by mean species Z-score for each country against all countries. (E) Monocle ordination of the gut microbiome. Individual samples from nonwestern and western countries were colored blue and orange, respectively. (F) Difference in gene content between western and nonwestern enriched species. Those species gene content was annotated by those that were CAZymes, antimicrobial-resistance (AMR) genes, and virulence factors (PATRIC database) and summed across all species. Total number of each gene was normalized and plotted as a stacked bar plot to show regional overrepresentation (Methods).

**Figure 2.**
Pan-metagenomics association studies (Pan-MGAS) of 43 cohorts from 23 different diseases and 14 countries (n = 2185). (A) We identified significantly enriched (*top*) and depleted (*bottom*) species of cohorts based on the effect sizes (ESs) of Wilcoxon rank-sum one-sided tests (ES ≥ 0.3; i.e., each dot represents the ES of an MSP in each disease data set); the complete list of values is provided in Supplemental Table S5. The blue dotted line indicates ES = 0.3; the red dotted line indicates ES = 0.5; and each dot in the plot represents one MSP within one disease cohort. (B) Scatter plots of the frequency of the significantly enriched/depleted cohorts of all MSPs (ES > 0.3): Each point represents an MSP; all values in the plot are integers; and jitter was added to remove overlapping points. The y-axis displays the total frequency of enriched/depleted cohorts (number of enriched cohorts + number of depleted cohorts), and the x-axis displays the subtracted frequency between enriched cohorts and depleted cohorts (number of enriched cohorts − number of depleted cohorts). Point coloring is based on the number of different diseases for which an MSP had an ES above 0.3. Commonly enriched/depleted species among cohorts were identified when total frequency ≥ 3 and absolute subtracted frequency ≥ 2. (C) Species found depleted (*Anaerostipes hadrus*) and enriched (*Fusobacterium nucleatum subsp. animalis*) in most disease cohorts. The blue dotted line indicates ES = 0.3; the red dotted line indicates ES = 0.5; and each dot in the plot represents one MSP within one disease cohort. Acronyms are as follows: (ACVD) acute coronary cardiovascular disease, (Ob) obesity, (CRC) colorectal cancer, (NSCLC) non-small-cell lung cancer, (RCC) renal cell carcinoma, (GDM) gestational diabetes mellitus, (T1D) type 1 diabetes, (T2D) type 2 diabetes, (LC) liver cirrhosis, (NAFLD) nonalcoholic fatty acid liver, (UC) ulcerative colitis, (CD) Crohn's disease, (BD) Behçet's disease, (RA) rheumatoid arthritis, (SPA) ankylosing spondylitis, (ME/CFS) myalgic encephalomyelitis/chronic fatigue syndrome, and (PD) Parkinson's disease.

**Figure 3.**
Analysis of functional clusters of the human gut microbiome. For the functional characterization of human gut MSPs, we annotated respective genes with 19,540 features of microbial function/phenotype databases and identified 7763 functional clusters better representing the microbiome. (A) Identification of functional clusters based on co-conserved molecular and biological functions across species. Unlike the manually curated module database, we identified functional clusters based on high co-conservation across species using the unsupervised clustering method. (B) The overall scheme of identification of functional clusters and checking functional coverage (cluster size) and taxonomic coverage (number of enriched species). (C) We found that among different sources of microbial functional annotations (e.g., KEGG module and pathway), co-conservation of molecular and biological functions across different species was substantially low (Jaccard index < 0.5). (D) Functional clusters identified by unsupervised community detection. The y-axis displays the number of genes within the functional cluster (i.e., functional coverage), and the x-axis displays the number of MSPs possessing >70% of the clusters’ genes (i.e., taxonomic coverage). (E) Functional clusters projected on enriched/depleted MSPs across disease cohorts. The scatter plot displays the frequency of functional clusters significantly associated with the enriched/depleted species (hypergeometric test P-value < 0.0001) in disease cohorts. Each point represents a gene cluster; all values in the plot are integers; and jitter was added to remove overlapping points. The y-axis shows the total frequency of cohorts in which a functional cluster was found significantly associated with enriched/depleted species. The x-axis shows the difference in the number of cohorts in which a function was found enriched minus the frequency it was found depleted. Point colors changed from red (*left*) to blue (*right*) according to x-axis values. Common enriched/depleted functional gene clusters among cohorts were identified when total frequency ≥ 3 and absolute subtracted frequency ≥ 2.

**Figure 4.**
Random forest (RF) models trained on multiple cohorts to discriminate between disease and healthy controls. (A) Schematic of RF classification method. (B) AUROC scores for each disease RF classification model. (C) AUROC curves of an inter (*top*) and intra (*bottom*) cohort validation for a RF model that predicts CRC. (D) Box plot of directional mean absolute SHAP scores for all disease predictive models. Red and blue boxes represent species that were depleted/enriched using effect size calculation. (E) Clustered heatmap (dendrogram omitted) of the most important species for prediction of 16 diseases by RF classification as calculated by directional mean SHAP score (rows contain at least one species with directional mean SHAP score above 0.0125 in any of the diseases; Methods). Positive values indicate that higher relative abundance is more likely to classify the disease versus healthy samples. Negative values indicate that lower relative abundance is more likely to classify the disease versus healthy samples. The *right* color bar indicates mean species bias for enrichment or depletion in all diseases. Acronyms are as follows: (CRC) colorectal cancer, (NSCLC) non-small-cell lung cancer, (RCC) renal cell carcinoma, (T1D) type 1 diabetes, (T2D) type 2 diabetes, (LC) liver cirrhosis, (NAFLD) nonalcoholic fatty acid liver, (CD) Crohn's disease, (RA) rheumatoid arthritis, (SPA) ankylosing spondylitis, (ME_CFS) myalgic encephalomyelitis/chronic fatigue syndrome, (IGT) impaired glucose tolerance, and (VKH) Vogt–Koyanagi–Harada.

See this image and copyright information in PMC

References

1. Allen-Vercoe E, Daigneault M, White A, Panaccione R, Duncan SH, Flint HJ, O'Neal L, Lawson PA. 2012. Anaerostipes hadrus comb. nov., a dominant species within the human colonic microbiota; reclassification of Eubacterium hadrum Moore et al. 1976. Anaerobe 18: 523–529. 10.1016/j.anaerobe.2012.09.002 - DOI - PubMed
1. Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, Finn RD. 2019. A new genomic blueprint of the human gut microbiota. Nature 568: 499–504. 10.1038/s41586-019-0965-1 - DOI - PMC - PubMed
1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. 10.1093/nar/25.17.3389 - DOI - PMC - PubMed
1. Bar N, Korem T, Weissbrod O, Zeevi D, Rothschild D, Leviatan S, Kosower N, Lotan-Pompan M, Weinberger A, Le Roy CI, et al. 2020. A reference map of potential determinants for the human serum metabolome. Nature 588: 135–140. 10.1038/s41586-020-2896-2 - DOI - PubMed
1. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al. 2004. The Pfam protein families database. Nucleic Acids Res 32(Database issue): D138–D141. 10.1093/nar/gkh121 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
- HighWire
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Global compositional and functional states of the human gut microbiome in health and disease

Affiliations

Global compositional and functional states of the human gut microbiome in health and disease

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources