Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 28;22(1):192.
doi: 10.1186/s13059-021-02400-4.

mbImpute: an accurate and robust imputation method for microbiome data

Affiliations

mbImpute: an accurate and robust imputation method for microbiome data

Ruochen Jiang et al. Genome Biol. .

Abstract

A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data-mbImpute-to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. We demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
An illustration of mbImpute. After mbImpute identifies likely non-biological zeros, it imputes them (e.g., the abundance of taxon 2 in sample 2) by jointly borrowing information from similar samples, similar taxa, and sample covariates if available (details in Methods)
Fig. 2
Fig. 2
mbImpute outperforms state-of-the-art imputation methods designed for non-microbiome data and enhances the identification of DA taxa. a Mean squared error (MSE) and b mean Pearson correlation of taxon abundances between the complete data and the zero-inflated data (“No imputation,” the baseline) or the imputed data by each imputation method (mbImpute, softImpute, scImpute, SAVER, MAGIC, and ALRA) in Simulations 1 and 2 (see Additional file 1). cd For each taxon, the mean and standard deviation (SD) of its abundances are calculated for the complete data, the zero-inflated data, and the imputed data by each imputation method in Simulation 1; c shows the distributions of the taxon mean/SD and the Wasserstein distance between every distribution and the complete distribution; d the taxa in two coordinates, mean vs. SD, and the average Euclidean distance between the taxa in every (zero-inflated or imputed) dataset and the complete data in these two coordinates. e Accuracy (precision, recall, and F1 scores) of five DA methods (Wilcoxon rank-sum test, ANCOM, metagenomeSeq, DESeq2-phyloseq, and Omnibus test) with the FDR threshold 0.05 on raw data (light color) and imputed data by mbImpute (dark color) in the 16S data simulation
Fig. 3
Fig. 3
mbImpute empowers DESeq2-phyloseq in identifying DA taxa. a The barplots show classification accuracy, measured by 5-fold cross-validated precision-recall area under the curve (PR-AUC), by the random forest algorithm for predicting samples’ disease conditions in two T2D datasets [18, 19] and four CRC datasets [–17]. The features are the DA taxa detected by DESeq2-phyloseq (light color) or mbImpute-empowered DESeq2-phyloseq (dark color; labeled as mbImpute + DESeq2-phyloseq). b The histograms show the distributions of three taxa in control and T2D samples in [18] before and after mbImpute is applied. The three taxa, Ruminococcus sp_5_1_39BFAA, Ruminococcus callidus, and Ruminococcus albus, are identified as enriched in T2D samples only after imputation. c The histograms show the distributions of three taxa in control and CRC samples in [17] before and after mbImpute is applied. The three taxa, Ruminococcus gnavus, Lachnospiraceae bacterium_2_1_58FAA, and Granulicatella adiacens, are identified as enriched in CRC samples only after imputation. In b and c, adjusted p values calculated by DESeq2-phyloseq are listed
Fig. 4
Fig. 4
mbImpute preserves distributional characteristics of taxa’s non-zero abundances. a Top: two scatter plots show the relationship between the abundances of Dorea formicigenerans and Ruminococcus torques in Qin et al.’s control samples [19], with or without using mbImpute as a preceding step. The left plot shows two standard major axis (SMA) regression lines and two corresponding Pearson correlations based on the raw data (black: based on all the samples; blue: based on only the samples where both taxa have non-zero abundances). The right plot shows the SMA regression line (blue) and the Pearson correlation using all the samples in the imputed data. Bottom: two scatter plots for the same two taxa in Qin et al.’s T2D samples [19], with lines and legends defined the same as in the top panel. b Four scatter plots show the SMA regression lines and correlations between Eubacterium sirasum and Ruminococcus obeum in Karlsson et al.’s control and T2D samples [18], with lines and legends defined the same as in a. c Each bar shows the Pearson correlation between taxon-taxon correlations in raw data (light gray) or imputed data (dark gray) using all samples and taxon-taxon correlations in raw data using non-zero samples only. The two correlations are calculated for two T2D datasets and four CRC datasets using diseased samples, control samples, and whole data
Fig. 5
Fig. 5
mbImpute improves the similarity of taxon-taxon correlations between 16S and WGS data of microbiomes in healthy human stool samples. Four Pearson correlation matrices are calculated based on a common set of genus-level taxa’s abundances in 16S and WGS data, with or without using mbImpute as a preceding step. Before imputation, the Pearson correlation between the two correlation matrices is 0.59, and this correlation increases to 0.64 after imputation. For illustration purposes, each heatmap shows square roots of Pearson correlations, with the bottom 40% of values truncated to 0. The magenta, green, and purple squares highlight three taxon groups, each of which contains strongly correlated taxa and is consistent between the 16S and WGS data after imputation

References

    1. Katherine RA. An introduction to microbiome analysis for human biology applications. Am J Hum Biol. 2017;29(1):e22931. doi: 10.1002/ajhb.22931. - DOI - PubMed
    1. Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444(7122):1027. doi: 10.1038/nature05414. - DOI - PubMed
    1. Samuel BS, Gordon JI. A humanized gnotobiotic mouse model of host–archaeal–bacterial mutualism. Proc Natl Acad Sci. 2006;103(26):10011–6. doi: 10.1073/pnas.0602187103. - DOI - PMC - PubMed
    1. Stokholm J, Blaser MJ, Thorsen J, Rasmussen MA, Waage J, Vinding RK, Schoos A-MM, Kunøe A, Fink NR, Chawes BL, et al. Maturation of the gut microbiome and risk of asthma in childhood. Nat Commun. 2018;9(1):1–10. doi: 10.1038/s41467-017-02088-w. - DOI - PMC - PubMed
    1. Pragman AA, Kim HB, Reilly CS, Wendt C, Isaacson RE. The lung microbiome in moderate and severe chronic obstructive pulmonary disease. PloS ONE. 2012;7(10):e47305. doi: 10.1371/journal.pone.0047305. - DOI - PMC - PubMed

Publication types

MeSH terms