Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 4;26(2):bbaf141.
doi: 10.1093/bib/bbaf141.

Evaluation of imputation and imputation-free strategies for differential abundance analysis in metaproteomics data

Affiliations

Evaluation of imputation and imputation-free strategies for differential abundance analysis in metaproteomics data

Xinyi Mou et al. Brief Bioinform. .

Abstract

For metaproteomics data derived from the collective protein composition of dynamic multi-organism systems, the proportion of missing values and dimensions of data exceeds that observed in single-organism experiments. Consequently, evaluations of differential analysis strategies in other mass spectrometry (MS) data (such as proteomics and metabolomics) may not be directly applicable to metaproteomics data. In this study, we systematically evaluated five imputation methods [sample minimum, quantile regression, k-nearest neighbors (KNN), Bayesian principal component analysis (bPCA), random forest (RF)] and six imputation-free methods (moderated t-test, two-part t-test, two-part Wilcoxon test, semiparametric differential abundance analysis, differential abundance analysis with Bayes shrinkage estimation of variance method, and Mixture) for differential analysis in simulated metaproteomic datasets based on both data-dependent acquisition MS experiments and emerging data-independent acquisition experiments. The simulation datasets comprised 588 scenarios by considering the impacts of sample size, fold change between case and control, and missing value ratio at random and nonrandom. Compared to imputation-free methods, KNN, bPCA, and RF imputation performed poorly in datasets with a high missingness ratio and large sample size and resulted in a high false-positive risk. We made empirical recommendations based on the balance of sensitivity in analysis and control of false positives. The moderated t-test was optimal in scenarios of large sample size with a low missingness ratio. The two-part Wilcoxon test was recommended in scenarios of small sample size with a low missingness ratio or large sample size with a high missingness ratio. The comprehensive evaluations in our study can provide guidance for the differential abundance analysis in metaproteomics.

Keywords: differential abundance analysis; imputation missing mechanism; metaproteomics; missing value; two-part statistics.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Process for dataset simulation and statistical methods to be assessed. Two simulations (Simulations 1 and 2) were conducted. The parameters used in each simulation (sample size, fold change, missingness ratio, and MNAR ratio) were displayed in the corresponding column. The simulation process encompasses a non-missing value step (to generate complete data) and a missing-value step (to introduce varying ratios of MNAR and MCAR). Six imputation-free methods and five imputation methods were evaluated in the study.
Figure 2
Figure 2
PAUROC of statistical tests for imputation and imputation-free methods across different MNAR ratios. (A) pAUROC in the simulation scenario for the DDA_HMiss dataset (sample size = 100, fold change = 2, missingness ratio = 0.75). (B, C) pAUROC for the simulation scenario of DDA_LMiss and DIA_LMiss datasets. Within each panel, boxplots are categorized into three subcolumns based on the types of statistical methods. For imputation methods, “nonparametric” denotes methods coupled with Wilcoxon test, and “parametric” denotes methods coupled with t-test. Wilcoxon test and t-test utilizing data with missing values eliminated were labeled with just Wilcox and T.
Figure 3
Figure 3
FPR of statistical tests for imputation and imputation-free methods across different MNAR ratios. (A) FPR in the simulation scenario for the DDA_HMiss dataset (sample size = 100, fold change = 2, missingness ratio = 0.75). (B, C) FPR for the simulation scenario of DDA_LMiss and DIA_LMiss datasets.
Figure 4
Figure 4
The bias of effect size compared to complete data in each round of simulation in DDA_HMiss. The difference between the effect size of missing data (labeled as “missing” on the x-axis of each panel) or imputed data (labeled with the imputation method names) and the completed data were quantified and visualized. In each panel, scenarios with different MNAR ratios of 0.2, 0.4, 0.6, and 0.8 are displayed. Differentially abundant proteins (labeled as “positive”) and proteins with no difference abundance (labeled as “negative”) were plotted separately.
Figure 5
Figure 5
Performance of statistical methods in broad scenarios. (A, B) statistical methods with the highest average pAUROC and AUPRC across scenarios with varying sample sizes and missingness ratios. The pAUROC and AUPRC values from 100 rounds of simulations across different MNAR ratios and fold changes were averaged. The color in each grid represents the method with the highest pAUROC and AUPRC in the given scenario, with the corresponding value labeled. (C, D) comparisons of twoWilcox with the methods yielding the highest average pAUROC and AUPRC, and lowest FPR in scenarios with sample sizes of 40 and 100. The boxplots represent the overall distribution of metrics across different MNAR ratios and fold changes, while the average metric was shown as the red diamond mark. Different methods are represented by different colors. See Supplementary Fig. S5 for comparisons of other sample sizes.
Figure 6
Figure 6
Execution time of statistical methods. The boxplots represent the execution time for each method across 100 simulation rounds with a sample size of 200, a missingness ratio of 0.7, and MNAR ratios of 0.2, 0.4, 0.6, and 0.8.

Similar articles

Cited by

References

    1. Wu E, Xu G, Xie D. et al. Data-independent Acquisition in Metaproteomics. Expert Rev Proteomics 0:1–10. 10.1080/14789450.2024.2394190. - DOI - PubMed
    1. Sun Z, Ning Z, Figeys D. The landscape and perspectives of the human gut Metaproteomics. Mol Cell Proteomics 2024;23:100763. 10.1016/j.mcpro.2024.100763. - DOI - PMC - PubMed
    1. Miller SE, Colman AS, Waldbauer JR. Metaproteomics reveals functional partitioning and vegetational variation among permafrost-affected Arctic soil bacterial communities. mSystems 2023;8:e0123822. 10.1128/msystems.01238-22. - DOI - PMC - PubMed
    1. Quiton-Tapia S, Trueba-Santiso A, Garrido JM. et al. Metalloenzymes play major roles to achieve high-rate nitrogen removal in N-Damo communities: Lessons from Metaproteomics. Bioresour Technol 2023;385:129476. 10.1016/j.biortech.2023.129476. - DOI - PubMed
    1. Cohen NR, Krinos AI, Kell RM. et al. Microeukaryote metabolism across the western North Atlantic Ocean revealed through autonomous underwater profiling. Nat Commun 2024;15:7325. 10.1038/s41467-024-51583-4. - DOI - PMC - PubMed

Publication types