. 2025 Jan 25;15(1):3189.

doi: 10.1038/s41598-025-87256-5.

Characterizing the omics landscape based on 10,000+ datasets

Eva Brombacher^{1

2

3

4}, Oliver Schilling^{5

6

7}, Clemens Kreutz^{8

9}

Affiliations

¹ Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany.
² Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany.
³ Spemann Graduate School of Biology and Medicine (SGBM), University of Freiburg, Freiburg, Germany.
⁴ Faculty of Biology, University of Freiburg, Freiburg, Germany.
⁵ Institute for Surgical Pathology, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
⁶ German Cancer Consortium (DKTK) and German Cancer Research Center (DKFZ), Heidelberg, Germany.
⁷ BIOSS Centre for Biological Signaling Studies, University of Freiburg, Freiburg, Germany.
⁸ Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany. clemens.kreutz@uniklinik-freiburg.de.
⁹ Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany. clemens.kreutz@uniklinik-freiburg.de.

PMID: 39863642
PMCID: PMC11762699
DOI: 10.1038/s41598-025-87256-5

Characterizing the omics landscape based on 10,000+ datasets

Eva Brombacher et al. Sci Rep. 2025.

. 2025 Jan 25;15(1):3189.

doi: 10.1038/s41598-025-87256-5.

Authors

Eva Brombacher^{1

2

3

4}, Oliver Schilling^{5

6

7}, Clemens Kreutz^{8

9}

Affiliations

¹ Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany.
² Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany.
³ Spemann Graduate School of Biology and Medicine (SGBM), University of Freiburg, Freiburg, Germany.
⁴ Faculty of Biology, University of Freiburg, Freiburg, Germany.
⁵ Institute for Surgical Pathology, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
⁶ German Cancer Consortium (DKTK) and German Cancer Research Center (DKFZ), Heidelberg, Germany.
⁷ BIOSS Centre for Biological Signaling Studies, University of Freiburg, Freiburg, Germany.
⁸ Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany. clemens.kreutz@uniklinik-freiburg.de.
⁹ Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany. clemens.kreutz@uniklinik-freiburg.de.

PMID: 39863642
PMCID: PMC11762699
DOI: 10.1038/s41598-025-87256-5

Abstract

The characteristics of data produced by omics technologies are pivotal, as they critically influence the feasibility and effectiveness of computational methods applied in downstream analyses, such as data harmonization and differential abundance analyses. Furthermore, variability in these data characteristics across datasets plays a crucial role, leading to diverging outcomes in benchmarking studies, which are essential for guiding the selection of appropriate analysis methods in all omics fields. Additionally, downstream analysis tools are often developed and applied within specific omics communities due to the presumed differences in data characteristics attributed to each omics technology. In this study, we investigate over ten thousand datasets to understand how proteomics, metabolomics, lipidomics, transcriptomics, and microbiome data vary in specific data characteristics. We were able to show patterns of data characteristics specific to the investigated omics types and provide a tool that enables researchers to assess how representative a given omics dataset is for its respective discipline. Moreover, we illustrate how data characteristics can impact analyses at the example of normalization in the presence of sample-dependent proportions of missing values. Given the variability of omics data characteristics, we encourage the systematic inspection of these characteristics in benchmark studies and for downstream analyses to prevent suboptimal method selection and unintended bias.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
The data types cluster according to the investigated data characteristics. (a) Clustering results obtained using Uniform Manifold Approximation and Projection (UMAP), where each data point represents a dataset. Distinct clusters are formed for scRNA-seq, bulk RNA-seq, microarray, microbiome, metabolomics/lipidomics, and proteomics datasets. (b) Objective of this study and outlook.

**Fig. 2**
Visualization of how data characteristics contribute to the clustering of different data types in a Uniform Manifold Approximation and Projection (UMAP). The subplots are color-coded according to the values of the investigated data characteristics. Grey points represent datasets for which no information on the respective data characteristic is available.

**Fig. 3**
For most investigated data characteristics, substantial differences exist between the data types. The vertical red lines represent the median of the medians for each data type. For better comparison, the rank is displayed for ‘Kurtosis’, ‘Skewness’, ‘Lin. coef. of Poly2(Means vs. Vars)(Analytes)’, and ‘Quadr. coef. of Poly2(Means vs. Vars)(Analytes)’.

**Fig. 4**
Relationship between sample- and batch-dependent detection limits, sample mean-missing value (NA) correlations, and the appropriateness of a normalization step. (a) A dataset with sample-dependent detection limits shows a positive correlation. This positive correlation indicates that normalization is inappropriate – demonstrated by quantile normalization applied to a simulated dataset with sample-dependent detection limits – as it increases sample bias. The red line, representing the 90th percentile, is used as a reference, as it should remain relatively unaffected by bias caused by small intensity values becoming missing values. (b) A dataset with batch-dependent detection limits shows a negative correlation. This negative correlation indicates that normalization is beneficial – demonstrated by quantile normalization applied to a simulated dataset with batch-dependent detection limits – as it decreases sample bias.

See this image and copyright information in PMC

Cited by

Evaluation of normalization strategies for mass spectrometry-based multi-omics datasets.
Tseng CY, Salguero JA, Breidenbach JD, Solomon E, Sanders CK, Harvey T, Thornhill MG, Palmisano SJ, Sasiene ZJ, Blackwell BR, McBride EM, Luchini KA, LeBrun ES, Alvarez M, Mach PM, Rivera ES, Glaros TG. Tseng CY, et al. Metabolomics. 2025 Jul 1;21(4):98. doi: 10.1007/s11306-025-02297-1. Metabolomics. 2025. PMID: 40593232 Free PMC article.

References

1. Strobl, C. & Leisch, F. Against the “one method fits all data sets” philosophy for comparison studies in methodological research. Biometrical Journal (2022). - PubMed
1. Nießl, C., Herrmann, M., Wiedemann, C., Casalicchio, G. & Boulesteix, A.-L. Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery12, e1441 (2022).
1. Boulesteix, A.-L., Wilson, R. & Hapfelmeier, A. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Medical Research Methodology17, 1–12 (2017). - PMC - PubMed
1. Boulesteix, A.-L. Ten simple rules for reducing overoptimistic reporting in methodological computational research (2015). - PMC - PubMed
1. Nießl, C., Hoffmann, S., Ullmann, T. & Boulesteix, A.-L. Explaining the optimistic performance evaluation of newly proposed methods: A cross-design validation experiment. Biometrical Journal 2200238 (2023). - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Characterizing the omics landscape based on 10,000+ datasets

Affiliations

Characterizing the omics landscape based on 10,000+ datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources