Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 25;15(1):3189.
doi: 10.1038/s41598-025-87256-5.

Characterizing the omics landscape based on 10,000+ datasets

Affiliations

Characterizing the omics landscape based on 10,000+ datasets

Eva Brombacher et al. Sci Rep. .

Abstract

The characteristics of data produced by omics technologies are pivotal, as they critically influence the feasibility and effectiveness of computational methods applied in downstream analyses, such as data harmonization and differential abundance analyses. Furthermore, variability in these data characteristics across datasets plays a crucial role, leading to diverging outcomes in benchmarking studies, which are essential for guiding the selection of appropriate analysis methods in all omics fields. Additionally, downstream analysis tools are often developed and applied within specific omics communities due to the presumed differences in data characteristics attributed to each omics technology. In this study, we investigate over ten thousand datasets to understand how proteomics, metabolomics, lipidomics, transcriptomics, and microbiome data vary in specific data characteristics. We were able to show patterns of data characteristics specific to the investigated omics types and provide a tool that enables researchers to assess how representative a given omics dataset is for its respective discipline. Moreover, we illustrate how data characteristics can impact analyses at the example of normalization in the presence of sample-dependent proportions of missing values. Given the variability of omics data characteristics, we encourage the systematic inspection of these characteristics in benchmark studies and for downstream analyses to prevent suboptimal method selection and unintended bias.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The data types cluster according to the investigated data characteristics. (a) Clustering results obtained using Uniform Manifold Approximation and Projection (UMAP), where each data point represents a dataset. Distinct clusters are formed for scRNA-seq, bulk RNA-seq, microarray, microbiome, metabolomics/lipidomics, and proteomics datasets. (b) Objective of this study and outlook.
Fig. 2
Fig. 2
Visualization of how data characteristics contribute to the clustering of different data types in a Uniform Manifold Approximation and Projection (UMAP). The subplots are color-coded according to the values of the investigated data characteristics. Grey points represent datasets for which no information on the respective data characteristic is available.
Fig. 3
Fig. 3
For most investigated data characteristics, substantial differences exist between the data types. The vertical red lines represent the median of the medians for each data type. For better comparison, the rank is displayed for ‘Kurtosis’, ‘Skewness’, ‘Lin. coef. of Poly2(Means vs. Vars)(Analytes)’, and ‘Quadr. coef. of Poly2(Means vs. Vars)(Analytes)’.
Fig. 4
Fig. 4
Relationship between sample- and batch-dependent detection limits, sample mean-missing value (NA) correlations, and the appropriateness of a normalization step. (a) A dataset with sample-dependent detection limits shows a positive correlation. This positive correlation indicates that normalization is inappropriate – demonstrated by quantile normalization applied to a simulated dataset with sample-dependent detection limits – as it increases sample bias. The red line, representing the 90th percentile, is used as a reference, as it should remain relatively unaffected by bias caused by small intensity values becoming missing values. (b) A dataset with batch-dependent detection limits shows a negative correlation. This negative correlation indicates that normalization is beneficial – demonstrated by quantile normalization applied to a simulated dataset with batch-dependent detection limits – as it decreases sample bias.

Similar articles

Cited by

References

    1. Strobl, C. & Leisch, F. Against the “one method fits all data sets” philosophy for comparison studies in methodological research. Biometrical Journal (2022). - PubMed
    1. Nießl, C., Herrmann, M., Wiedemann, C., Casalicchio, G. & Boulesteix, A.-L. Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery12, e1441 (2022).
    1. Boulesteix, A.-L., Wilson, R. & Hapfelmeier, A. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Medical Research Methodology17, 1–12 (2017). - PMC - PubMed
    1. Boulesteix, A.-L. Ten simple rules for reducing overoptimistic reporting in methodological computational research (2015). - PMC - PubMed
    1. Nießl, C., Hoffmann, S., Ullmann, T. & Boulesteix, A.-L. Explaining the optimistic performance evaluation of newly proposed methods: A cross-design validation experiment. Biometrical Journal 2200238 (2023). - PubMed

Publication types

LinkOut - more resources