A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies

doi:10.1186/s13059-024-03390-9

. 2024 Sep 25;25(1):247.

doi: 10.1186/s13059-024-03390-9.

A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies

Jakob Wirbel^#¹, Morgan Essex^#^{2

3

4}, Sofia Kirke Forslund^{5

6

7

8

9}, Georg Zeller^{10

11

12}

Affiliations

¹ Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
² Experimental and Clinical Research Center (ECRC), a cooperation of the Max-Delbrück Center and Charité-Universitätsmedizin, Berlin, Germany.
³ Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany.
⁴ Charité-Universitätsmedizin Berlin (a corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin), Berlin, Germany.
⁵ Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. sofia.forslund@mdc-berlin.de.
⁶ Experimental and Clinical Research Center (ECRC), a cooperation of the Max-Delbrück Center and Charité-Universitätsmedizin, Berlin, Germany. sofia.forslund@mdc-berlin.de.
⁷ Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany. sofia.forslund@mdc-berlin.de.
⁸ Charité-Universitätsmedizin Berlin (a corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin), Berlin, Germany. sofia.forslund@mdc-berlin.de.
⁹ German Center for Cardiovascular Research (DZHK), Partner Site Berlin, Berlin, Germany. sofia.forslund@mdc-berlin.de.
¹⁰ Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. georg.zeller@gmail.com.
¹¹ Center for Infectious Diseases (LUCID), Leiden University, Leiden University Medical Center (LUMC), Leiden, Netherlands. georg.zeller@gmail.com.
¹² Center for Microbiome Analyses and Therapeutics (CMAT), Leiden University Medical Center, Leiden, Netherlands. georg.zeller@gmail.com.

^# Contributed equally.

PMID: 39322959
PMCID: PMC11423519
DOI: 10.1186/s13059-024-03390-9

A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies

Jakob Wirbel et al. Genome Biol. 2024.

. 2024 Sep 25;25(1):247.

doi: 10.1186/s13059-024-03390-9.

Authors

Jakob Wirbel^#¹, Morgan Essex^#^{2

3

4}, Sofia Kirke Forslund^{5

6

7

8

9}, Georg Zeller^{10

11

12}

Affiliations

¹ Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
² Experimental and Clinical Research Center (ECRC), a cooperation of the Max-Delbrück Center and Charité-Universitätsmedizin, Berlin, Germany.
³ Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany.
⁴ Charité-Universitätsmedizin Berlin (a corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin), Berlin, Germany.
⁵ Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. sofia.forslund@mdc-berlin.de.
⁶ Experimental and Clinical Research Center (ECRC), a cooperation of the Max-Delbrück Center and Charité-Universitätsmedizin, Berlin, Germany. sofia.forslund@mdc-berlin.de.
⁷ Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany. sofia.forslund@mdc-berlin.de.
⁸ Charité-Universitätsmedizin Berlin (a corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin), Berlin, Germany. sofia.forslund@mdc-berlin.de.
⁹ German Center for Cardiovascular Research (DZHK), Partner Site Berlin, Berlin, Germany. sofia.forslund@mdc-berlin.de.
¹⁰ Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. georg.zeller@gmail.com.
¹¹ Center for Infectious Diseases (LUCID), Leiden University, Leiden University Medical Center (LUMC), Leiden, Netherlands. georg.zeller@gmail.com.
¹² Center for Microbiome Analyses and Therapeutics (CMAT), Leiden University Medical Center, Leiden, Netherlands. georg.zeller@gmail.com.

^# Contributed equally.

PMID: 39322959
PMCID: PMC11423519
DOI: 10.1186/s13059-024-03390-9

Abstract

Background: In microbiome disease association studies, it is a fundamental task to test which microbes differ in their abundance between groups. Yet, consensus on suitable or optimal statistical methods for differential abundance testing is lacking, and it remains unexplored how these cope with confounding. Previous differential abundance benchmarks relying on simulated datasets did not quantitatively evaluate the similarity to real data, which undermines their recommendations.

Results: Our simulation framework implants calibrated signals into real taxonomic profiles, including signals mimicking confounders. Using several whole meta-genome and 16S rRNA gene amplicon datasets, we validate that our simulated data resembles real data from disease association studies much more than in previous benchmarks. With extensively parametrized simulations, we benchmark the performance of nineteen differential abundance methods and further evaluate the best ones on confounded simulations. Only classic statistical methods (linear models, the Wilcoxon test, t-test), limma, and fastANCOM properly control false discoveries at relatively high sensitivity. When additionally considering confounders, these issues are exacerbated, but we find that adjusted differential abundance testing can effectively mitigate them. In a large cardiometabolic disease dataset, we showcase that failure to account for covariates such as medication causes spurious association in real-world applications.

Conclusions: Tight error control is critical for microbiome association studies. The unsatisfactory performance of many differential abundance methods and the persistent danger of unchecked confounding suggest these contribute to a lack of reproducibility among such studies. We have open-sourced our simulation and benchmarking software to foster a much-needed consolidation of statistical methodology for microbiome research.

Keywords: Benchmark; Confounding; Differential abundance; Metagenomics; Microbiome.

PubMed Disclaimer

Conflict of interest statement

None of the authors declare any competing interests.

Figures

**Fig. 1**
Signal implantation, but not parametric simulations, can reproduce key characteristics of metagenomic data and realistic disease effects. a Principal coordinate projections on log-Euclidean distances for real samples (from Zeevi et al. [33], which served as a baseline data set) and representative samples of data simulated in a case–control setting (groups 1 and 2) using different simulation frameworks or signal implantation. For each method, the results from a single repetition and a fixed effect size are shown (abundance scaling factor of 2 with an additional prevalence shift of 0.2 in our simulations, see the “Methods” section and Additional File 1: Fig. S4 for the complete parameter space). b Distributions of log-transformed feature variances shown for the real and simulated data from a. c The area under the receiver operating curve (AUROC) values from machine learning models (see the “Methods” section) to distinguish between real and simulated samples are shown across all simulated data sets in cyan. As complementary information, the log-transformed F values from PERMANOVA tests are shown in brown. Sparsity (fraction of taxa with zero abundance in a sample) is shown below in magenta. Boxes denote the interquartile range (IQR) across all values with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. d The absolute generalized fold change [6] and the absolute difference in prevalence across groups is shown for all features in colorectal cancer (CRC) and Crohn’s disease (CD). As a comparison, the same values are displayed for two data sets simulated using signal implantation (abundance scaling factor of 2, prevalence shift of 0.2), with implantations either into any feature or only low-abundance features (see the “Methods” section). Well-described disease-associated features are highlighted (F: *Faecalibacterium*, *R: Ruminococcus*) and selected bacterial taxa and simulated features are shown as percentile plot in e

**Fig. 2**
Performance evaluation of differential abundance testing methods and simulation strategies. For a signal implantation simulation with a single, moderate effect size combination (abundance scaling factor of 2, prevalence shift of 0.2, all features eligible for implantation), a the mean observed FDR and b the mean observed recall (calculated after Benjamini-Hochberg (BH) correction of raw P values) are shown for all included DA test methods across different sample sizes (see the “Methods” section). Additionally, mean AUROC values for differentiating between implanted and background features (calculated from raw P values) are shown in c. The nominally expected value of a 5% FDR is indicated by a dotted black line in panel a. Since ANCOM does not return P values (see the “Methods” section), observed FDR and recall were based on the recommended cutoffs (without adjustment) and therefore highlighted by dashed lines. Marginal annotations of method ranks correspond to the ranking based on AUROC values, with methods without sufficient FDR control ranked last (see panel d). d The mean AUROC values across all effect sizes and repetitions for the sample sizes 50, 100, and 200 (shaded area in a–c) are depicted in the heatmap for the different simulation strategies and baseline datasets, including non-gut human-associated microbiomes. The AUROC values of methods that exceeded a mean observed FDR of 10% in more than 10% of test settings (combination of effect and sample sizes) are shown in gray, whereas methods with sufficient FDR control are colored in shades of green. Methods with sufficient FDR control are ordered by their AUROC values on the Zeevi WGS gut dataset. For some simulations, the mean AUROC values were below 0.50 (indicated by < 0.5) or did not produce results in the allotted time (48 h for each combination of effect size, repetition, and sample size variation; indicated by stars)

**Fig. 3**
Loss of precision and recall under confounding can be alleviated by statistical adjustment. a Using a single dataset, DA features were independently implanted into a small proportion of taxa for both a main group label (as described above) and for an independent binary (confounder) label, imitating, e.g., disease and medication status labels, respectively. Subsets for DA testing were generated using a parameterized resampling technique such that the degree of association (measured by ϕ) between these two variables could be modified (i.e., deliberately biased). b Generalized fold change (gFC) calculated for the label is contrasted to the gFC calculated for differences between confounder values across all bacterial taxa (abundance scaling factor of 2, prevalence shift of 0.2, all features eligible for implantation, a single representative repeat shown). Bars at the right visualize the confounder strength by showing the proportion of confounder-positive samples in each group (with ϕ = 0 serving as unconfounded control). Main implanted features are highlighted in green and features implanted for the confounder label are in blue. c Mean observed FDR, observed recall (both calculated after BH-correction), and AUROC (on raw P values) for sample size 200 and the same effect sizes as shown in a) were computed for tested DA methods, using unadjusted and confounder-adjusted test configurations. Error bars indicate standard deviation around the mean for all repeats. d Simulated (log₁₀ relative) abundances plotted by main and confounder labels (see Fig. 1 for definition of abundance quantiles), with both unadjusted and confounder-adjusted significance shown at the top, colored as in c. e *Escherichia* abundance appears naively associated with type 2 diabetes, yet is driven by metformin intake in a subset of diabetics (reproduced from Forslund et al.⁸)

**Fig. 4**
Linear models are capable of disentangling drug- and disease-associated microbial features. a Regression coefficients from a subset of disease-drug combinations comparing naive linear models to adjusted mixed-effect models for all bacterial taxa. Adjusted models included a second term (either drug intake or disease status for the x- and y-axes, respectively) as a random effect, which diminished the strong linear dependence between naive model coefficients (shown). When the significance of each term was compared between the naive and adjusted models (see the “Methods” section) drug-specific or confounded effects were revealed in some features. b Exemplary subset of features displaying either the largest number of significant disease associations across different drug-adjusted models or the largest reductions in disease coefficient significance upon adjustment (i.e., most confounded). c Comparison of feature classifications (see the “Methods” section) from the metformin- and PPI-adjusted disease association models across all bacterial taxa. Integrating information across models restricts disease associations to a more robust subset and reveals drug-confounded associations. Adjusted T2D regression coefficients are shown in light gray or light brown bars behind species names (indicating enrichment in T2D or control group, respectively)

See this image and copyright information in PMC

Cited by

Elementary methods provide more replicable results in microbial differential abundance analysis.
Pelto J, Auranen K, Kujala JV, Lahti L. Pelto J, et al. Brief Bioinform. 2025 Mar 4;26(2):bbaf130. doi: 10.1093/bib/bbaf130. Brief Bioinform. 2025. PMID: 40135504 Free PMC article.
Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data.
Kohnert E, Kreutz C. Kohnert E, et al. F1000Res. 2025 Jan 2;13:1180. doi: 10.12688/f1000research.155230.2. eCollection 2024. F1000Res. 2025. PMID: 39866725 Free PMC article.
Semisynthetic simulation for microbiome data analysis.
Sankaran K, Kodikara S, Li JJ, Cao KL. Sankaran K, et al. Brief Bioinform. 2024 Nov 22;26(1):bbaf051. doi: 10.1093/bib/bbaf051. Brief Bioinform. 2024. PMID: 39927858 Free PMC article. Review.
Evaluation of extended-spectrum β-lactamase producing bacteria in feces of shelter dogs as a biomarker for altered gut microbial taxa and functional profiles.
Abdi R, Datta S, Zawar A, Kafle P. Abdi R, et al. Front Microbiol. 2025 Mar 24;16:1556442. doi: 10.3389/fmicb.2025.1556442. eCollection 2025. Front Microbiol. 2025. PMID: 40196031 Free PMC article.
Consensus approach to differential abundance analysis detects few differences in the oral microbiome of pregnant women due to pre-existing type 2 diabetes mellitus.
Leech SM, Barrett HL, Dorey ES, Mullins T, Laurie J, Nitert MD. Leech SM, et al. Microb Genom. 2025 Apr;11(4):001385. doi: 10.1099/mgen.0.001385. Microb Genom. 2025. PMID: 40232948 Free PMC article.

See all "Cited by" articles

References

1. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–14. - PMC - PubMed
1. Voigt AY, et al. Temporal and technical variability of human gut metagenomes. Genome Biol. 2015;16:73. - PMC - PubMed
1. Gevers D, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe. 2014;15:382–92. - PMC - PubMed
1. Franzosa EA, et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol. 2019;4:293–305. - PMC - PubMed
1. Thomas AM, et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med. 2019;25:667–78. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central

[1] Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–14. - PMC - PubMed

[2] Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–14. - PMC - PubMed

[3] Voigt AY, et al. Temporal and technical variability of human gut metagenomes. Genome Biol. 2015;16:73. - PMC - PubMed

[4] Voigt AY, et al. Temporal and technical variability of human gut metagenomes. Genome Biol. 2015;16:73. - PMC - PubMed

[5] Gevers D, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe. 2014;15:382–92. - PMC - PubMed

[6] Gevers D, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe. 2014;15:382–92. - PMC - PubMed

[7] Franzosa EA, et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol. 2019;4:293–305. - PMC - PubMed

[8] Franzosa EA, et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol. 2019;4:293–305. - PMC - PubMed

[9] Thomas AM, et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med. 2019;25:667–78. - PMC - PubMed

[10] Thomas AM, et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med. 2019;25:667–78. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies

Affiliations

A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources