Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 3;39(2):btad021.
doi: 10.1093/bioinformatics/btad021.

Dealing with dimensionality: the application of machine learning to multi-omics data

Affiliations

Dealing with dimensionality: the application of machine learning to multi-omics data

Dylan Feldner-Busztin et al. Bioinformatics. .

Abstract

Motivation: Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets.

Results: Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments.

Availability and implementation: All data and processing scripts are available at this GitLab repository: https://gitlab.com/polavieja_lab/ml_multi-omics_review/ or in Zenodo: https://doi.org/10.5281/zenodo.7361807.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of paper selection process
Fig. 2.
Fig. 2.
(a) Number of uses of each -omics category in the reviewed papers. (b) Number of -omics used per paper in the reviewed papers. (c) The number of appearances -omics pairs across papers
Fig. 3.
Fig. 3.
‘Shape’ of multi-omics datasets. Number of samples (x-axis), number of features (y-axis)
Fig. 4.
Fig. 4.
(a) Number of ML techniques being used more than once in the reviewed papers. The publications on ‘machine learning AND multi-omics AND integration’ are plotted in green, while the publications on ‘machine learning AND integration’ are plotted in purple. Significant differences were observed for autoencoders, and Cox proportional hazards (Cox PH) (Cox, 1972). Fisher’s exact test P-values of <0.0001 in both cases, satisfying the Bonferroni correction for this number of tests. Number in the reviewed multi-omics ML papers of ML goals (b) and labels used for classification (c)
Fig. 5.
Fig. 5.
Number of citations per year, of papers published in different years
Fig. 6.
Fig. 6.
Shapes in datasets. (a) A dataset where n ≫ P, the ideal ‘shape’ for many ML techniques. (b) In multi-omics analyses, researchers face very wide datasets, where n ≪ P. (c) Feature selection and extraction are often used to reduce the number of features. In feature selection, a subset of the original features is kept. In feature extraction, features are merged and transformed into a smaller number of new ones

References

    1. Athreya A. et al. (2018) Augmentation of physician assessments with multi-omics enhances predictability of drug response: a case study of major depressive disorder. IEEE Comput. Intell. Mag., 13, 20–31. - PMC - PubMed
    1. Avsec Ž. et al. (2021) Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods, 18, 1196–1203. - PMC - PubMed
    1. Bahdanau D. et al. (2014) Neural machine translation by jointly learning to align and translate. arXiv, arXiv:1409.0473, preprint: not peer reviewed.
    1. Baker R. et al. (2022) Mechanistic models versus machine learning, a fight worth fighting for the biological community? R. Soc. Biol. Lett., 14(5), 20170660. - PMC - PubMed
    1. Barsi S. et al. (2021) Modeling in systems biology: causal understanding before prediction? Patterns, 2, 100280. - PMC - PubMed

Publication types