. 2025 Mar 4;26(2):bbaf141.

doi: 10.1093/bib/bbaf141.

Evaluation of imputation and imputation-free strategies for differential abundance analysis in metaproteomics data

Xinyi Mou¹, Haoyu Du¹, Guanghua Qiao¹, Jing Li¹

Affiliations

Affiliation

¹ Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China.

PMID: 40254829
PMCID: PMC12009712
DOI: 10.1093/bib/bbaf141

Evaluation of imputation and imputation-free strategies for differential abundance analysis in metaproteomics data

Xinyi Mou et al. Brief Bioinform. 2025.

. 2025 Mar 4;26(2):bbaf141.

doi: 10.1093/bib/bbaf141.

Authors

Xinyi Mou¹, Haoyu Du¹, Guanghua Qiao¹, Jing Li¹

Affiliation

¹ Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China.

PMID: 40254829
PMCID: PMC12009712
DOI: 10.1093/bib/bbaf141

Abstract

For metaproteomics data derived from the collective protein composition of dynamic multi-organism systems, the proportion of missing values and dimensions of data exceeds that observed in single-organism experiments. Consequently, evaluations of differential analysis strategies in other mass spectrometry (MS) data (such as proteomics and metabolomics) may not be directly applicable to metaproteomics data. In this study, we systematically evaluated five imputation methods [sample minimum, quantile regression, k-nearest neighbors (KNN), Bayesian principal component analysis (bPCA), random forest (RF)] and six imputation-free methods (moderated t-test, two-part t-test, two-part Wilcoxon test, semiparametric differential abundance analysis, differential abundance analysis with Bayes shrinkage estimation of variance method, and Mixture) for differential analysis in simulated metaproteomic datasets based on both data-dependent acquisition MS experiments and emerging data-independent acquisition experiments. The simulation datasets comprised 588 scenarios by considering the impacts of sample size, fold change between case and control, and missing value ratio at random and nonrandom. Compared to imputation-free methods, KNN, bPCA, and RF imputation performed poorly in datasets with a high missingness ratio and large sample size and resulted in a high false-positive risk. We made empirical recommendations based on the balance of sensitivity in analysis and control of false positives. The moderated t-test was optimal in scenarios of large sample size with a low missingness ratio. The two-part Wilcoxon test was recommended in scenarios of small sample size with a low missingness ratio or large sample size with a high missingness ratio. The comprehensive evaluations in our study can provide guidance for the differential abundance analysis in metaproteomics.

Keywords: differential abundance analysis; imputation missing mechanism; metaproteomics; missing value; two-part statistics.

PubMed Disclaimer

Figures

**Figure 1**
Process for dataset simulation and statistical methods to be assessed. Two simulations (Simulations 1 and 2) were conducted. The parameters used in each simulation (sample size, fold change, missingness ratio, and MNAR ratio) were displayed in the corresponding column. The simulation process encompasses a non-missing value step (to generate complete data) and a missing-value step (to introduce varying ratios of MNAR and MCAR). Six imputation-free methods and five imputation methods were evaluated in the study.

**Figure 2**
PAUROC of statistical tests for imputation and imputation-free methods across different MNAR ratios. (A) pAUROC in the simulation scenario for the DDA_HMiss dataset (sample size = 100, fold change = 2, missingness ratio = 0.75). (B, C) pAUROC for the simulation scenario of DDA_LMiss and DIA_LMiss datasets. Within each panel, boxplots are categorized into three subcolumns based on the types of statistical methods. For imputation methods, “nonparametric” denotes methods coupled with Wilcoxon test, and “parametric” denotes methods coupled with t-test. Wilcoxon test and t-test utilizing data with missing values eliminated were labeled with just Wilcox and T.

**Figure 3**
FPR of statistical tests for imputation and imputation-free methods across different MNAR ratios. (A) FPR in the simulation scenario for the DDA_HMiss dataset (sample size = 100, fold change = 2, missingness ratio = 0.75). (B, C) FPR for the simulation scenario of DDA_LMiss and DIA_LMiss datasets.

**Figure 4**
The bias of effect size compared to complete data in each round of simulation in DDA_HMiss. The difference between the effect size of missing data (labeled as “missing” on the x-axis of each panel) or imputed data (labeled with the imputation method names) and the completed data were quantified and visualized. In each panel, scenarios with different MNAR ratios of 0.2, 0.4, 0.6, and 0.8 are displayed. Differentially abundant proteins (labeled as “positive”) and proteins with no difference abundance (labeled as “negative”) were plotted separately.

**Figure 5**
Performance of statistical methods in broad scenarios. (A, B) statistical methods with the highest average pAUROC and AUPRC across scenarios with varying sample sizes and missingness ratios. The pAUROC and AUPRC values from 100 rounds of simulations across different MNAR ratios and fold changes were averaged. The color in each grid represents the method with the highest pAUROC and AUPRC in the given scenario, with the corresponding value labeled. (C, D) comparisons of twoWilcox with the methods yielding the highest average pAUROC and AUPRC, and lowest FPR in scenarios with sample sizes of 40 and 100. The boxplots represent the overall distribution of metrics across different MNAR ratios and fold changes, while the average metric was shown as the red diamond mark. Different methods are represented by different colors. See Supplementary Fig. S5 for comparisons of other sample sizes.

**Figure 6**
Execution time of statistical methods. The boxplots represent the execution time for each method across 100 simulation rounds with a sample size of 200, a missingness ratio of 0.7, and MNAR ratios of 0.2, 0.4, 0.6, and 0.8.

See this image and copyright information in PMC

Cited by

Ginsenoside Rg5 alleviates hypoxia-induced myocardial apoptosis by targeting STAT3 to promote Tyr705 phosphorylation.
Li FY, Wang YH, Zhang C, Dang WY, Wu ZK, Wu ZH, Cui JL, Wu XJ, Yang CQ, Tian XC, Xiao CR, Wang YG, Gao Y. Li FY, et al. Chin Med. 2025 Jun 13;20(1):86. doi: 10.1186/s13020-025-01128-8. Chin Med. 2025. PMID: 40514732 Free PMC article.

References

1. Wu E, Xu G, Xie D. et al. Data-independent Acquisition in Metaproteomics. Expert Rev Proteomics 0:1–10. 10.1080/14789450.2024.2394190. - DOI - PubMed
1. Sun Z, Ning Z, Figeys D. The landscape and perspectives of the human gut Metaproteomics. Mol Cell Proteomics 2024;23:100763. 10.1016/j.mcpro.2024.100763. - DOI - PMC - PubMed
1. Miller SE, Colman AS, Waldbauer JR. Metaproteomics reveals functional partitioning and vegetational variation among permafrost-affected Arctic soil bacterial communities. mSystems 2023;8:e0123822. 10.1128/msystems.01238-22. - DOI - PMC - PubMed
1. Quiton-Tapia S, Trueba-Santiso A, Garrido JM. et al. Metalloenzymes play major roles to achieve high-rate nitrogen removal in N-Damo communities: Lessons from Metaproteomics. Bioresour Technol 2023;385:129476. 10.1016/j.biortech.2023.129476. - DOI - PubMed
1. Cohen NR, Krinos AI, Kell RM. et al. Microeukaryote metabolism across the western North Atlantic Ocean revealed through autonomous underwater profiling. Nat Commun 2024;15:7325. 10.1038/s41467-024-51583-4. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of imputation and imputation-free strategies for differential abundance analysis in metaproteomics data

Affiliation

Evaluation of imputation and imputation-free strategies for differential abundance analysis in metaproteomics data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources