Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 19;20(5):e0321452.
doi: 10.1371/journal.pone.0321452. eCollection 2025.

Benchmarking Differential Abundance Tests for 16S microbiome sequencing data using simulated data based on experimental templates

Affiliations

Benchmarking Differential Abundance Tests for 16S microbiome sequencing data using simulated data based on experimental templates

Eva Kohnert et al. PLoS One. .

Abstract

Differential abundance (DA) analysis of metagenomic microbiome data is essential for understanding microbial community dynamics across various environments and hosts. Identifying microorganisms that differ significantly in abundance between conditions (e.g., health vs. disease) is crucial for insights into environmental adaptations, disease development, and host health. However, the statistical interpretation of microbiome data is challenged by inherent sparsity and compositional nature, necessitating tailored DA methods. This benchmarking study aims to simulate synthetic 16S microbiome data using metaSPARSim (Patuzzi I, Baruzzo G, Losasso C, Ricci A, Di Camillo B. MetaSPARSim: a 16S rRNA gene sequencing count data simulator. BMC Bioinformatics. 2019;20:416. https://doi.org/10.1186/s12859-019-2882-6 PMID: 31757204) MIDASim (He M, Zhao N, Satten GA. MIDASim: a fast and simple simulator for realistic microbiome data. Available from: https://doi.org/10.1101/2023.03.23.533996), and sparseDOSSA2 (Ma S, Ren B, Mallick H, Moon YS, Schwager E, Maharjan S, et al. A statistical model for describing and simulating microbial community profiles. PLOS Comput Biol. 2021;17(9):e1008913. https://doi.org/10.1371/journal.pcbi.1008913 PMID: 34516542) , leveraging 38 real-world experimental templates (S3 Table) previously utilized in a benchmark study comparing DA tools. These datasets, drawn from diverse environments such as human gut, soil, and marine habitats, serve as the foundation for our simulation efforts. We employ the same 14 DA tests that were previously used with the same experimental data in benchmark studies alongside 8 DA tests that were developed subsequently. Initially, we will generate synthetic data closely mirroring the experimental datasets, incorporating a known truth to cover a broad range of real-world data characteristics. This approach allows us to assess the ability of DA methods to recover known true differential abundances. We will further simulate datasets by altering sparsity, effect size, and sample size, thus creating a comprehensive collection for applying the 22 DA tests. The outcomes, focusing on sensitivities and specificities, will provide insights into the performance of DA tests and their dependencies on sparsity, effect size, and sample size. Additionally, we will calculate data characteristics (S1 and S2 Table) for each simulated dataset and use a multiple regression to identify informative data characteristics influencing test performance. Our prior study, where we used simulated data without incorporating a known truth, demonstrated the feasibility of using synthetic data to validate experimental findings. This current study aims to enhance our understanding by systematically evaluating the impact of known truth incorporation on DA test performance, thereby providing further information for the selection and application of DA methods in microbiome research.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Preliminary results assessing the similarity of simulated data and corresponding templates. A.
Overall similarity of simulated data and templates for metaSPARSim. PCA plot on 46 scaled data characteristics for 38 templates and 10 corresponding simulations. Templates are plotted as squares and simulations as dots in the same colour. B. Accuracy of four representative single data characteristics. Overall magnitudes of visible bias and heterogeneities are highlighted by blue arrows. The left sections in all panels show the natural variability of a specific data characteristic among the templates. Here, the log2-ratios of the data characteristics from one template to all other is summarized as boxplot. In the middle the precision of the data characteristic in the simulations compared to the corresponding template is displayed. The right sections show log2-ratios of the data characteristic between all simulations belonging to the same template.
Fig 2
Fig 2. Overview about the data generating mechanism including the dataset selection process.
Flowchart summarizing the data generating and selection mechanism throughout the entire workflow of the study.
Fig 3
Fig 3. Illustration of detection of outlier data sets after simulation.
Each dot represents the number of non-equivalent data characteristics for a data template. If this number is an outlier in the boxplot the synthetic data from this template will be removed from the analysis. A If sparseDOSSA2 would result in such an outcome, the synthetic dataset for the template MALL would be removed. B If metaSPARSim would result in such a boxplot, based on the outlier criteria two data templates would be removed from the analysis (Ji_WTP_DS and t1d_alkanani).
Fig 4
Fig 4. Overview of the complete analysis workflow including the data simulation process. Fig 4 provides an overview about the analyses conducted within our study that are described in the following sections.

Similar articles

References

    1. Nearing JT, Douglas GM, Hayes MG, MacDonald J, Desai DK, Allward N, et al.. Microbiome differential abundance methods produce different results across 38 datasets. Nat Commun. 2022;13(1):342. doi: 10.1038/s41467-022-28034-z - DOI - PMC - PubMed
    1. Kohnert E, Kreutz C. Computational study protocol: leveraging synthetic data to validate a benchmark study for Differential Abundance Tests for 16S microbiome sequencing data. F1000Research. 2025. Jan 2;13:1180. - PMC - PubMed
    1. Patuzzi I, Baruzzo G, Losasso C, Ricci A, Di Camillo B. MetaSPARSim: a 16S rRNA gene sequencing count data simulator. BMC Bioinformatics. 2019;20:416. doi: 10.1186/s12859-019-2882-6 - DOI - PMC - PubMed
    1. Ma S, Ren B, Mallick H, Moon YS, Schwager E, Maharjan S, et al.. A statistical model for describing and simulating microbial community profiles. PLOS Comput Biol. 2021;17(9):e1008913. doi: 10.1371/journal.pcbi.1008913 - DOI - PMC - PubMed
    1. He M, Zhao N, Satten GA. MIDASim: a fast and simple simulator for realistic microbiome data. Available from: doi: 10.1101/2023.03.23.533996 - DOI - PMC - PubMed

Substances

LinkOut - more resources