Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb;638(8051):718-728.
doi: 10.1038/s41586-024-08485-8. Epub 2025 Jan 29.

Expanding the human gut microbiome atlas of Africa

Affiliations

Expanding the human gut microbiome atlas of Africa

Dylan G Maghini et al. Nature. 2025 Feb.

Abstract

Population studies provide insights into the interplay between the gut microbiome and geographical, lifestyle, genetic and environmental factors. However, low- and middle-income countries, in which approximately 84% of the world's population lives1, are not equitably represented in large-scale gut microbiome research2-4. Here we present the AWI-Gen 2 Microbiome Project, a cross-sectional gut microbiome study sampling 1,801 women from Burkina Faso, Ghana, Kenya and South Africa. By engaging with communities that range from rural and horticultural to post-industrial and urban informal settlements, we capture a far greater breadth of the world's population diversity. Using shotgun metagenomic sequencing, we identify taxa with geographic and lifestyle associations, including Treponema and Cryptobacteroides species loss and Bifidobacterium species gain in urban populations. We uncover 1,005 bacterial metagenome-assembled genomes, and we identify antibiotic susceptibility as a factor that might drive Treponema succinifaciens absence in urban populations. Finally, we find an HIV infection signature defined by several taxa not previously associated with HIV, including Dysosmobacter welbionis and Enterocloster sp. This study represents the largest population-representative survey of gut metagenomes of African individuals so far, and paired with extensive clinical biomarkers and demographic data, provides extensive opportunity for microbiome-related discovery.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Microbiome composition and diversity in the AWI-Gen 2 cohort.
a, Sample number and location of each study site. Countries containing sites are dark grey. b, Principal coordinate analysis of all samples on the basis of Bray–Curtis distance on species-level prokaryotic profiles. Study site is colour-coded and the boxplots show the samples per site projected onto the first and second principal coordinate. c, Prokaryotic diversity (inverse Simpson’s index after rarefaction) per site (Kruskal–Wallis test, P < 2 × 10−16, n = 1,796 after quality control and removing data from male individuals). d, Heatmap showing the number of prokaryotic species with high generalized fold change between sites; sites are clustered on the basis of this number of species. e, The log10(relative abundance) of genera with the highest variance in fold change and median across sites. f, The log10 of the mean relative abundance per site is shown for all species within the genera shown in e. For Prevotella, Oribacterium, Cryptobacteroides and Treponema, all species with scientific names are highlighted; only the top abundant species with scientific names are indicated for the other genera. All panels represent data from n = 1,796 biologically independent samples. Boxplot boxes denote the interquartile range (IQR), thick black lines indicate the median, and whiskers indicate the most extreme points within 1.5-fold IQR. Supplementary Methods contain photographs and further information for each site. Source Data
Fig. 2
Fig. 2. Prokaryotic novelty and features.
a, Phylogenetic tree of 2,584 dereplicated bacterial MAGs. Outer ring indicates study site of origin, inner ring indicates GTDB phylum and teal branches indicate Spirochaetota. b,c, Total number of previously unknown bacterial genomes by phylum (b) and new and existing bacterial genomes in the AWI-Gen assemblies (c), relative to the UHGG collection. Only representative genomes are shown. d, Archaeal phyla and species found in the AWI-Gen 2 genome collection. e,f, Self-reported antibiotic use (e) and hip circumference in centimetres (f) of n = 617 individuals from Nanoro, Burkina Faso and Navrongo, Ghana with and without T. succinifaciens present in the gut microbiome. Differences tested with a linear model that adjusted accounting for site (e) and for site and antibiotic history (f) as random effects. Boxplot boxes denote the IQR, thick black lines indicate the median and whiskers indicate the most extreme points within 1.5-fold IQR. g, Prevalence of antibiotic resistance genes in n = 244 T. succinifaciens MAGs, ordered by drug class. h, Prevalence of glycoside hydrolase genes among the six species of Spirochaetota with the largest number of MAGs in the AWI-Gen 2 genome collection. Only glycoside hydrolases present in at least 5% of the genomes are shown. Source Data
Fig. 3
Fig. 3. Viral novelty and diversity in the AWI-Gen 2 cohort.
a, Prevalence of viral genomes found in at least 18 individuals (approximately 1% of the AWI-Gen 2 population). Prevalence is measured as the proportion of the population in a given site that yielded an assembled viral genome that shares 95% ANI with the representative viral genome. Viral genomes that are new relative to the MGV catalogue and viral genomes that fall under the Crassvirales order are highlighted. b, Total number of new and existing viral genomes relative to MGV. c, Phage richness (number of phage species clusters present in each sample) per site (Kruskal–Wallis test P < 2 × 10−16, n = 1,796). Boxplot boxes denote the IQR, thick black lines indicate the median and whiskers indicate the most extreme points within 1.5-fold IQR. d, Prevalence of Crassvirales viruses and prototypical crAssphage by site, determined by read-level abundance. e, Genome maps of nine previously unknown jumbophages with genome annotations and length in kb, and count of notable genetic features in each genome. Source Data
Fig. 4
Fig. 4. Microbial composition and diversity in PLWH.
a, Number of seronegative individuals (HIV−) and PLWH on antiretroviral treatment. b, Prokaryotic diversity (inverse Simpson’s index on rarefied counts) by site and HIV status. Points represent individual samples. Differences in alpha-diversity for each individual site were tested with two-sided analysis of variance and for all sites combined with a linear mixed effect model accounting for site as a random effect. c, Principal coordinate analysis of species-level Bray–Curtis distance. Points represent individual samples, coloured by site, and PLWH are shaded. Boxplots show samples by HIV status projected onto the principal coordinates. d, Differentially abundant species (q-value < 0.01) determined by a linear mixed effect model. Species with q-value <1 × 10−5 are annotated. Shading indicates abundance fold change between seronegative individuals and PLWH. Black bars indicate previously unknown taxa from this study. e, ROC for machine learning models trained to distinguish HIV status on samples from each site or for all samples. Shading indicates the 95% confidence interval and numbers indicate AU-ROC. f, AU-ROC for machine learning model evaluation. Models trained on each site were applied to the other sites and external predictions were evaluated by means of AU-ROC. In LOSO validation, models were trained on two sites and validated on the left-out site. g, Fraction of samples from other sites predicted to be positive calibrated at a 5% false positive rate (indicated by dashed black line). For DIMAMO, HIV status is known and false positive rate and true positive rate can be evaluated. Serostatus is unknown for individuals in Nanoro and Navrongo but is expected to be below 2%. Boxplot boxes denote the IQR, thick black lines indicate the median and whiskers indicate the most extreme points within 1.5-fold IQR. Source Data
Extended Data Fig. 1
Extended Data Fig. 1. Overview of the AWI-Gen 2 Microbiome study.
a) Organizational chart of the AWI-Gen 2 project. The partnership, funded by the National Institutes of Health under the umbrella of the Human Heredity and Health in Africa consortium (H3Africa), includes five Health and Demographic Surveillance Sites (HDSSs) and the Soweto MRC/Wits Developmental Pathways for Health Research Unit (DPHRU). The HDSSs and DPHRU are managed by the Clinical Research Unit of Nanoro Institut de Recherche en Sciences de la Santé (CRUN/IRSS), Navrongo Health Research Centre (NHRC), University of Limpopo Population Health Research Centre (UoL–PHRC), University of the Witwatersrand and the South African Medical Research Council (Wits/MRC), and African Population Health and Research Center (APHRC). Researchers from Stanford University and the University of the Witwatersrand led the microbiome analysis. b) Timeline of the AWI-Gen 2 microbiome study research activities, including study administration, sample collection, and community engagement. During both AWI-Gen phases, researchers led microbiome and bioinformatic workshops for local researchers. Community engagement preceded sample collection at all sites, and participants with concerning health-related results were referred to their local healthcare facilities in accordance with site-specific protocols. Community engagement in Nairobi continued intermittently throughout sample collection to accommodate roadblocks during the COVID-19 pandemic. Post-study engagement was conducted at all sites, and microbiome-specific return of results is complete at three study sites.
Extended Data Fig. 2
Extended Data Fig. 2. Microbiome composition of male and female participants in Navrongo, Ghana.
a) Prokaryotic richness (number of prokaryotic species present at ≥ 1 × 10−4% relative abundance after rarefaction, see Methods) in n = 16 males and n = 218 females in Navrongo, Ghana (Wilcoxon test, P = 0.027). Points indicate individual samples. (In total, 19 samples from male participants were sequenced). b) Generalized fold change between male and female participants for all species with a prevalence higher than 5% in Navrongo is plotted against the negative log10-transformed q-value (Benjamini-Hochberg corrected p-value). Positive values correspond to higher relative abundance in males, whereas negative fold change values indicate higher relative abundance in female participants. No species meet the threshold of significance after correction for multiple testing. For all boxplots, boxes denote the interquartile range (IQR) with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. Source Data
Extended Data Fig. 3
Extended Data Fig. 3. Sequencing depth and database effects on taxonomic classification.
a) Reads per sample throughout quality control, including original read count, and reads remaining after deduplication, end trimming, and removal of host reads. P-values indicate Kruskal-Wallis tests with Benjamini Hochberg multiple testing correction. b) Spearman correlation coefficient (Spearman’s ρ) and R2 for a linear model between phage richness (number of assembled phages) or prokaryotic richness (number of prokaryotic species present at ≥ 1 × 10−4% relative abundance after rarefaction) and total read count and total assembly length (length of the total assembly in base pairs). Points represent individual samples. Blue line indicates a linear association model with 95% confidence intervals shown as shaded areas. c) Count and relative abundance of unassigned reads per sample, as estimated by the mOTUs profiler using the original database (v3.0.3) or extended database. d) Spearman’s ρ between the original and extended database for each sample, separated by study site. Prokaryotic species with abundance of zero in both the original and extended database were removed on a per-sample basis. e) The cumulative abundance of the genomes added to the database for profiling are shown for each sample, separated by study site. Figures represent data from n = 1,796 samples. For all boxplots, boxes denote the interquartile range (IQR) with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. Source Data
Extended Data Fig. 4
Extended Data Fig. 4. Phylum-level differences between AWI-Gen sites and in external datasets.
a) Spearman correlation coefficient (Spearman’s ρ) between principal coordinate values and the relative abundance of selected prokaryotic phyla. Phyla with an absolute correlation coefficient higher than 0.5 for either of the first two principal coordinates are shown (see Fig. 1 in the main text). Points represent individual samples and are coloured by site. b) Principal coordinate analysis of all AWI-Gen 2 samples based on Bray-Curtis distance on species-level prokaryotic profiles together with other large datasets, color-coded by study. Franzosa et al. and Schirmer et al. are datasets collected in the USA and the Netherlands, focusing on patients with inflammatory bowel disease and healthy controls, respectively. Yachida et al. is a dataset from Japan for the study of colorectal cancer. c) Relative abundance of the most abundant phyla across the different datasets. Phyla are ordered by mean abundance across all included samples. Figures represent data from n = 1,796 AWI-Gen, n = 220 Franzosa et al., n = 471 Schirmer et al., and n = 645 Yachida et al. samples. For all boxplots, boxes denote the interquartile range (IQR) with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. Source Data
Extended Data Fig. 5
Extended Data Fig. 5. Metadata correlation and distance-based redundancy analysis.
a) Pearson correlation coefficient (Pearson’s r) between available participant covariates, calculated on all participants included in the site comparison (n = 1,796). Non-numerical covariates were transformed into numerical values based on ordered factor levels (see Supplementary Methods). Asterisks indicate highly correlated covariates (Pearson’s r ≥ 0.8). In those cases, the covariate that explained the higher amount of variance in the prokaryotic composition (see panel b) was selected (redundant variables are indicated by grey labels). b) The amount of variance in the prokaryotic composition that is explained by covariates in distance-based redundancy analysis. Blue bars indicate single-covariate models (each covariate associated with prokaryotic composition individually), whereas orange bars show the amount of variance explained in the iterative model in which the variable explaining the most additional variation is added iteratively to a multi-covariate model (see Methods). Covariates below the dashed line were removed before the iterative modelling since they were highly correlated with other covariates. BMI: body mass index, MVPA: moderate to vigorous physical activity, LDL: low-density lipoproteins, HDL: high-density lipoproteins, VAT: visceral adipose tissue, SCAT: subcutaneous adipose tissue, cIMT: carotid intima-media thickness. Source Data
Extended Data Fig. 6
Extended Data Fig. 6. Site-level prevalence and differential abundance of microbial taxa.
a) The prevalence per site is shown for all prokaryotic species with prevalence higher than 5% in at least 2 sites (n = 1,071 species), clustered using the Ward algorithm as implemented in the R stats v4.2.2 package. Spearman correlation between sites is shown on the right. Population prevalence is calculated for each study site, where prevalence of zero indicates that the species is absent in all individuals in a site, and prevalence of one indicates that the species is present in all individuals in a site. b) The mean log10-transformed abundance of the same prokaryotic species as in a). Species that belong to the genera with the highest variance in fold change across all sites are highlighted by colours. c) The log10-transformed relative abundance of the genus Prevotella plotted against the relative abundance for the genera Bacteroides and Phocaeicola. Points represent n = 1,796 individual samples, coloured by site. d) The fraction of samples in which both Prevotella and either Bacteroides or Phocaeicola are present (relative abundance ≥ 1 × 10−4) is shown across sites, indicating that these genera co-exist in most samples. Source Data
Extended Data Fig. 7
Extended Data Fig. 7. Prokaryotic novelty in the AWI-Gen 2 cohort.
a) Total number of novel and existing prokaryotic proteins in the AWI-Gen assemblies, relative to UHGP. Only representative proteins after feature clustering are represented. Number of novel b) prokaryotic genomes relative to the UHGG and c) prokaryotic proteins relative to the UHGP95 present in each sample. Points indicate the number of genome or protein clusters present per sample (n = 1,820 total samples) that are not found in respective feature databases. For all boxplots, boxes denote the interquartile range (IQR) with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. d) Comparison of number of representative genomes contributed by several metagenomic gut microbiome studies, including the UHGG (global), Carter et al. (Tanzania), Yachida et al. (Japan), Franzosa et al. (USA, Netherlands), Schirmer et al. (western Europe), and Lochlainn et al. (United Kingdom). The UpSet plot shows the number of genomes that are shared between or unique to each study. Note that Carter et al. performed ultra-deep metagenomic sequencing, leading to a high number of MAGs generated per individual sample. Rarefaction curves of the number of e) prokaryotic genomes and f) prokaryotic proteins detected as a function of the number of individuals sampled, by study site or from the full AWI-Gen sample set (grey). Each random subset was repeated a hundred times, and lines represent the mean feature count and standard deviation. Source Data
Extended Data Fig. 8
Extended Data Fig. 8. Features of Treponema succinifaciens metagenome-assembled genomes (MAGs).
a) Number of high-quality T. succinifaciens metagenome-assembled genomes by study site. b) Distribution of the length, in megabase pairs (Mbp), of each T. succinifaciens MAG. MAGs from Soweto are not pictured, as Soweto samples only contained two MAGs. c) Number of genes in each MAG that were classified as core (≥ 80% prevalence), shell (25 ≤ prevalence < 80%), or cloud genes (< 25% prevalence) in the complete MAG set. d) Midpoint-rooted phylogenetic tree of T. succinifaciens MAGs from this study (noted in pink inner ring) and public data sets (n = 430 total genomes). Middle ring indicates the country of origin, and outer ring indicates the continent of origin. White line and asterisk indicate the T. succinifaciens DSM 2489 type strain reference genome. PERMANOVA test indicates significant difference in phylogenetic distance by country of origin (P = 0.001). Source Data
Extended Data Fig. 9
Extended Data Fig. 9. Additional characterization of viral novelty and diversity in the AWI-Gen 2 cohort.
Number of novel viral genomes relative to the MGV (a) and the Zolfo et al. viral catalogue (b) present in each sample (n = 1,820 total samples). Points indicate the number of genome clusters present per sample that are not found in respective feature databases. c) Rarefaction curves of the number of viral genomes detected as a function of the number of individuals sampled, by study site or from the full AWI-Gen sample set (grey). Each random subset was repeated a hundred times, and lines represent the mean feature count. d) Spearman correlation coefficient (Spearman’s ρ) between prokaryotic richness and viral richess. Points represent individual samples. e) Viral richness per sample, based on Phanta profiles (number of phage species clusters present ≥ 10−5% relative abundance). f) Prevalence of jumbophages across sites, where prevalence indicates the percent of individuals at each site with 0.1× coverage of the indicated jumbophage genome, as measured by CoverM (see Methods). All colors indicate site, using colour-code in panel a. For all boxplots, boxes denote the interquartile range (IQR) with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. Source Data
Extended Data Fig. 10
Extended Data Fig. 10. Phage, prokaryotic, and phenotypic differences in ART+ and ART- PLWH.
a) Prokaryotic diversity (inverse Simspon’s index after rarefaction) and phage richness (species present at ≥10−5% abundance) by HIV and antiretroviral therapy status. Points represent individual samples. Differences in diversity by site were tested with ANOVA and across sites with a linear mixed effect model accounting for site as a random effect. b) Generalized fold change (gFC) for all species in HIV+ ART+ relative to HIV− individuals and for HIV+ ART− relative to HIV− individuals. Species are coloured by q-value in HIV− vs HIV+ ART+ comparison. Species with an absolute gFC ≥ 0.3 in the HIV− vs HIV+ ART− comparison (that do not exhibit a gFC ≥ 0.3 in the HIV− vs HIV+ ART+ comparison) are annotated. c) Prediction from machine learning model trained prokaryotic data from HIV− and HIV+ ART+ participants and applied to HIV+ ART- participants. Sample fraction predicted to be positive at a 5% internal false positive rate (dashed line) is listed below. d) HIV-associated effect size for prokaryotic and phage species. Species are colored by q-value. e) Receiver-operating characteristic (ROC) for models trained to distinguish HIV status using phage composition. Shading indicates 95% confidence intervals and numbers show area under the ROC curve (AU-ROC). f) AU-ROC for models trained on participants from each site (panel e) and applied to other sites. Models were trained on two sites and validated on the left-out site for leave-one-site-out (LOSO) validation. g) Statistics for age, waist-to-hip ratio, cholesterol, and glucose for individuals who are HIV seronegative and seropositive on ART. All p-values result from Wilcox rank sum test. For all panels, n = 129 HIV+ ART+, n = 28 HIV+ ART−, n = 719 HIV−. For all boxplots, boxes denote the interquartile range (IQR) with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. Source Data

Update of

References

    1. World Bank Open Data. Population, total – Low & middle income, High income.https://data.worldbank.org/indicator/SP.POP.TOTL?locations=XO-XD (2023).
    1. Brewster, R. et al. Surveying gut microbiome research in Africans: toward improved diversity and representation. Trends Microbiol.27, 824–835 (2019). - PMC - PubMed
    1. Allali, I. et al. Human microbiota research in Africa: a systematic review reveals gaps and priorities for future research. Microbiome10, 10 (2022). - PMC - PubMed
    1. Abdill, R. J., Adamowicz, E. M. & Blekhman, R. Public human microbiome data are dominated by highly developed countries. PLoS Biol.20, e3001536 (2022). - PMC - PubMed
    1. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature464, 59–65 (2010). - PMC - PubMed

MeSH terms

LinkOut - more resources