Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 May 30:2025.03.24.644996.
doi: 10.1101/2025.03.24.644996.

omicsGMF: a multi-tool for dimensionality reduction, batch correction and imputation applied to bulk- and single cell proteomics data

Affiliations

omicsGMF: a multi-tool for dimensionality reduction, batch correction and imputation applied to bulk- and single cell proteomics data

Alexandre Segers et al. bioRxiv. .

Abstract

The unprecedented speed and sensitivity of mass spectrometry (MS) unlocked large-scale applications of proteomics and even enabled proteome profiling of single cells. However, this fast-evolving field is hindered by a lack of scalable dimensionality reduction tools that can compensate for substantial batch effects and missingness across MS runs. Therefore, we present omicsGMF, a fast, scalable, and interpretable matrix factorization method, tailored for bulk and single-cell proteomics data. Unlike current workflows that sequentially apply imputation, batch correction, and principal component analysis, omicsGMF integrates these steps into a unified framework, dramatically enhancing data processing and dimensionality reduction. Additionally, omicsGMF provides robust imputation of missing values, outperforming bespoke state-of-the-art imputation tools. We further demonstrate how this integrated approach increases statistical power to detect differentially abundant proteins in the downstream data analysis. Hence, omicsGMF is a highly scalable approach to dimensionality reduction in proteomics, that dramatically improves many important steps in proteomics data analysis.

PubMed Disclaimer

Conflict of interest statement

Ethics declarations Competing interests. The authors declare no competing interests

Figures

Fig. 1
Fig. 1. Issues with conventional multistep workflows for dimensionality reduction upon imputation.
Panel A shows data from the label-free, single cell Petrosius study [4] where the clusters of mouse-embryonic stem cells treated with and without inhibitor largely overlap, while the treatment is expected to change the proteome considerably. In Panel B a PCA-plot is made for the labeled single cell Leduc dataset [3] highlighting that batch effects are the main source of variability. It overwhelms the variability associated with the melanoma B subpopulation and renders the first dimensions obsolete for clustering cell-types. Panel C shows data from the label-free, bulk CPTAC spike-in study [16] with 48 human UPS proteins that were spiked in at five different concentrations in a yeast background. The experimental conditions with the lowest spike-in concentrations (Condition A and B) cannot be separated by the conventional multi-step workflow. All low dimensional visualizations were obtained with state-of-the-art CF-imputation [13] followed by PCA.
Fig. 2
Fig. 2. Schematic overview of the omicsGMF model for the Gaussian Model family.
Y is modeled in function of known sample-level covariates X, feature-level covariates Z, latent factors U and their loadings V. omicsGMF iteratively estimates the parameters B, Γ,U and V. omicsGMF addresses missing values by re-imputing them in each iteration with their current mean μt. The latent factors have a similar interpretation as principal components upon correcting for known covariates, and thus allow for dimensionality reduction and visualization. omicsGMF can also provide the imputed values upon convergence, which are useful for downstream applications. Furthermore, omicsGMF allows for model selection that can guide the user for choosing the number of latent factors and known covariates to be included in the model. More details can be found in the Methods section.
Fig. 3
Fig. 3. Low-dimensional visualization of proteomics data.
omicsGMF estimates latent factors that have a similar interpretation as regular PCA. These can be used for a low dimensional visualization of proteomics data and are compared to PCA plots after CF and KNN imputation of missing data. Panel A shows the results for the Petrosius [4] dataset, colored by inhibitor treatment. Panel B shows different cell-types from the Leduc [3] dataset. Here, omicsGMF directly accounts for known batch effects, resulting in a better representation of the biological signal compared to PCA after CF and KNN-imputation. Panel C and D show CPTAC data [16] from all labs, and upon exclusion of Lab 1, respectively (Lab 1 was known to suffer from ionization issues). Samples are colored by the spike-in concentration of human proteins, with A the lowest spike-in concentration, and E the highest spike-in concentration. Distinct marker shapes indicate the different labs.
Fig. 4
Fig. 4. Cross-validation with omicsGMF allows for comprehensive selection of the number of latent factors for dimensionality reduction.
Each panel shows the mean of the out-of-sample deviances over three cross-validation folds in function of the number of latent factors included in the model. In each fold, 30% of the values are masked for out-of-sample prediction. Panel A shows the cross-validation results for the Petrosius dataset [4] with and without accounting for the treatment effect (one dummy variable). Panel B shows the cross-validation results for the Leduc dataset [3], with and without correcting for the known batch-effect associated to multiplexing cells in the same run (142 dummy variables). Panel C shows the cross-validation results for the CPTAC data [16], considering all three labs. Results are shown for omicsGMF without known covariates, accounting for the lab effects (two dummy variables), and accounting for both the lab (two dummy variables) and spike-in concentration effects (four dummy variables).
Fig. 5
Fig. 5. omicsGMF imputation accounts for missingness due to low abundance.
Distributions of peptide intensities from human spike-in proteins are shown in function of the spike-in condition for the CPTAC data excluding Lab 1 that suffered from ionization issues. The first panel shows the distribution of observed values, and the other panels show the distributions of imputed intensities by omicsGMF, DAE, VAE, CF and KNN imputation respectively. The grey dashed line represents the median of the observed values, and the black dashed line the median of the imputed values.
Fig. 6
Fig. 6. omicsGMF imputation leads to better downstream differential abundance analysis.
Performance evaluation of differential abundance analyses using msqrob2 [25, 26] on the CPTAC dataset [16]. Results for the comparisons between the lowest spike-in concentrations B versus A, C versus A and C versus B are shown. Data from Lab 1 are excluded due to ionization issues. Human UPS proteins are differentially spiked between the conditions, with yeast background proteins as true negative control. Panel A shows the true positive rate (TPR) in function of the false discovery proportion (FDP). The dots on each curve represent working points when the FDR level is set at the nominal 5% level. Panel B shows the estimated log2 fold changes (FC) by msqrob2 for both human spike-in proteins, and for reference yeast proteins. The grey line indicates the known log2 FC.

References

    1. Bader J.M., Geyer P.E., Müller J.B., Strauss M.T., Koch M., Leypoldt F., Koertvelyessy P., Bittner D., Schipke C.G., Incesoy E.I., Peters O., Deigendesch N., Simons M., Jensen M.K., Zetterberg H., Mann M.: Proteome profiling in cerebrospinal fluid reveals novel biomarkers of alzheimer’s disease. Molecular Systems Biology 16(6), 9356 (2020) 10.15252/msb.20199356 - DOI - PMC - PubMed
    1. Niu L., Thiele M., Geyer P.E., Rasmussen D.N., Webel H.E., Santos A., Gupta R., Meier F., Strauss M., Kjaergaard M., Lindvig K., Jacobsen S., Rasmussen S., Hansen T., Krag A., Mann M.: Noninvasive proteomic biomarkers for alcohol-related liver disease. Nature Medicine 28(6), 1277–1287 (2022) 10.1038/s41591-022-01850-y - DOI - PMC - PubMed
    1. Leduc A., Huffman R.G., Cantlon J., Khan S., Slavov N.: Exploring functional protein covariation across single cells using npop. Genome Biology 23(1), 261 (2022) 10.1186/s13059-022-02817-5 - DOI - PMC - PubMed
    1. Petrosius V., Aragon-Fernandez P., Üresin N., Kovacs G., Phlairaharn T., Furtwängler B., Op De Beeck J., Skovbakke S.L., Goletz S., Thomsen S.F., Keller U.a.d., Natarajan K.N., Porse B.T., Schoof E.M.: Exploration of cell state heterogeneity using single-cell proteomics through sensitivity-tailored data-independent acquisition. Nature Communications 14(1), 5910 (2023) 10.1038/s41467-023-41602-1 - DOI - PMC - PubMed
    1. Vanderaa C., Gatto L.: Replication of single-cell proteomics data reveals important computational challenges. Expert Review of Proteomics 18(10), 835–843 (2021) 10.1080/14789450.2021.1988571 - DOI - PubMed

Publication types

LinkOut - more resources