Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 3;21(1):191.
doi: 10.1186/s13059-020-02104-1.

Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data

Affiliations

Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data

Matteo Calgaro et al. Genome Biol. .

Abstract

Background: The correct identification of differentially abundant microbial taxa between experimental conditions is a methodological and computational challenge. Recent work has produced methods to deal with the high sparsity and compositionality characteristic of microbiome data, but independent benchmarks comparing these to alternatives developed for RNA-seq data analysis are lacking.

Results: We compare methods developed for single-cell and bulk RNA-seq, and specifically for microbiome data, in terms of suitability of distributional assumptions, ability to control false discoveries, concordance, power, and correct identification of differentially abundant genera. We benchmark these methods using 100 manually curated datasets from 16S and whole metagenome shotgun sequencing.

Conclusions: The multivariate and compositional methods developed specifically for microbiome analysis did not outperform univariate methods developed for differential expression analysis of RNA-seq data. We recommend a careful exploratory data analysis prior to application of any inferential model and we present a framework to help scientists make an informed choice of analysis methods in a dataset-specific manner.

Keywords: Benchmark; Differential abundance; Metagenomics; Microbiome; Single-cell.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Starting from 41 Projects collected in 2 manually curated data repositories (HMP16SData and curatedMetagenomicData Bioconductor packages), 18 16S and 82 WMS datasets were downloaded. Biological samples belonged to several body sites (e.g., oral cavity), body subsites (e.g., tongue dorsum), and conditions (e.g., healthy vs. disease). Feature per sample count tables were used in order to evaluate several objectives: goodness of fit (GOF) for 5 parametric distributions, type I error control, concordance, and power for 18 differential abundance detection methods. Methods developed for metagenomics, bulk-RNAseq, or sc-RNAseq were ranked using empirical evaluations of the above-cited objectives
Fig. 2
Fig. 2
a Mean-difference (MD) plot and root mean squared errors (RMSE) for HMP 16S Stool samples. b MD plot and RMSE for HMP WMS Stool samples. c Average rank heatmap for MD performances in HMP 16S datasets, HMP WMS datasets and all other WMS datasets. The value inside each tile refers to the average RMSE value on which ranks are computed. d Zero probability difference (ZPD; see the “Methods” section) plot and RMSE for HMP 16S Stool samples. e ZPD plot and RMSE for HMP WMS Stool samples. f Average rank heatmap for ZPD performances in HMP 16S datasets, HMP WMS datasets, and all other WMS datasets. The value inside each tile refers to the average RMSE value on which ranks are computed
Fig. 3
Fig. 3
a Quantile-quantile plot from 0 to 1 and 0 to 0.1 zoom for DA methods in 41 16S HMP stool samples. Average curves for mock comparisons are reported. b Kolmogorov-Smirnov statistic boxplots for DA methods in 41 16S HMP stool samples. c Quantile-quantile plot from 0 to 1 and 0 to 0.1 zoom for DA methods in 41 WMS HMP stool samples. Average curves for mock comparisons are reported. d Kolmogorov-Smirnov statistic boxplots for DA methods in 41 WMS HMP stool samples. e Boxplots of the proportion of raw p values lower than the commonly used thresholds for the nominal α (0.01, 0.05, and 0.1) for 41 16S stool samples. f Boxplots of the proportion of raw p values lower than the commonly used thresholds for the nominal α (0.01, 0.05, and 0.1) for 41 WMS stool samples
Fig. 4
Fig. 4
a Between-method concordance (BMC) and within-method concordance (WMC) (main diagonal) averaged values from rank 1 to 100 for DA methods evaluated in replicated 16S Tongue Dorsum vs. Stool comparisons. b BMC and WMC (main diagonal) averaged values from rank 1 to 100 for DA methods evaluated in replicated WMS Tongue Dorsum vs. Stool comparisons
Fig. 5
Fig. 5
a. Boxplot of WMC on high diversity 16S datasets: Tongue Dorsum vs. Stool. Due to the high sparsity and low sample size of the dataset, the Concordance At the Top (CAT) at rank 100 was not computable for corncob methods: it was possible to estimate the model only for a few features. b Boxplot of WMC on high diversity WMS datasets: Tongue Dorsum vs. Stool. c Boxplot of WMC on mid diversity 16S datasets: Buccal Mucosa vs. Attached Keratinized Gingiva. d Boxplot of WMC on mid diversity WMS datasets: Schizophrenic vs. Healthy Control saliva samples. e Boxplot of WMC on low diversity 16S datasets: Supragingival vs. Subgingival plaque. f Boxplot of WMC on low diversity WMS datasets: Colon Rectal Cancer patient vs. Healthy Control stool samples
Fig. 6
Fig. 6
38vs38 Supragingival vs. Subgingival Plaque 16S samples a Barplot of the enrichment tests performed on the DA taxa found by each method using an adjusted p value of 0.1 as threshold for significance (top 10% ranked taxa for songbird). Each bar represents the number of findings, UP in Supragingival or DOWN in Supragingival Plaque compared to Subgingival Plaque, regarding aerobic, anaerobic, and facultative anaerobic taxa metabolism. A Fisher exact test was performed to establish the enrichment significance represented with signif. codes. b Difference between putative true positives (TP) and putative false positives (FP) (y-axis) for several significance thresholds (x-axis). Each threshold represents the top percent ranked taxa, using the ordered raw p value lists as reference (loading values for mixMC and differentials for songbird). c Aerobic metabolism taxa mutually found by 3 or more methods from the subset of the representative methods. d Anaerobic metabolism taxa mutually found by 8 or more methods from the subset of the representative methods
Fig. 7
Fig. 7
Overall method ranking based on 5 evaluation criteria. Average normalized ranks range from 0 to 1, lower values correspond to better performances. The type I error columns are based on the analysis of the 1000 mock comparisons from HMP 16S and WMS Stool datasets; the concordance analysis column is based on the average WMC values across the 100 random subset comparisons for each of the 6 datasets used. The power enrichment analysis and computational time columns are based on the Supragingival vs. Subgingival Plaque 16S dataset evaluations. Each method’s ordering is computed using the first 4 columns. Since the type I error analysis was not available for songbird and mixMC, these methods were not included in the final ranking

References

    1. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Env Microbiol. 2007;73:5261–5267. doi: 10.1128/AEM.00062-07. - DOI - PMC - PubMed
    1. Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods. 2015;12:902–903. doi: 10.1038/nmeth.3589. - DOI - PubMed
    1. Zhu S, Qing T, Zheng Y, Jin L, Shi L. Advances in single-cell RNA sequencing and its applications in cancer research. Oncotarget. 2017;8:53763–53779. doi: 10.18632/oncotarget.17893. - DOI - PMC - PubMed
    1. Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nat Biotechnol. 2016;34:1145–1160. doi: 10.1038/nbt.3711. - DOI - PMC - PubMed
    1. Papalexi E, Satija R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat Rev Immunol. 2018;18:35–45. doi: 10.1038/nri.2017.76. - DOI - PubMed

Publication types