. 2023 Feb 3;39(2):btad021.

doi: 10.1093/bioinformatics/btad021.

Dealing with dimensionality: the application of machine learning to multi-omics data

Dylan Feldner-Busztin¹, Panos Firbas Nisantzis¹, Shelley Jane Edmunds², Gergely Boza³, Fernando Racimo⁴, Shyam Gopalakrishnan², Morten Tønsberg Limborg², Leo Lahti⁵, Gonzalo G de Polavieja¹

Affiliations

¹ Champalimaud Centre for the Unknown, Champalimaud Foundation, 1400-038 Lisbon, Portugal.
² Center for Evolutionary Hologenomics, GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, 1353 Copenhagen, Denmark.
³ Centre for Ecological Research, 1113 Budapest, Hungary.
⁴ Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark.
⁵ Department of Computing, University of Turku, 20014 Turku, Finland.

PMID: 36637211
PMCID: PMC9907220
DOI: 10.1093/bioinformatics/btad021

Dealing with dimensionality: the application of machine learning to multi-omics data

Dylan Feldner-Busztin et al. Bioinformatics. 2023.

. 2023 Feb 3;39(2):btad021.

doi: 10.1093/bioinformatics/btad021.

Authors

Affiliations

¹ Champalimaud Centre for the Unknown, Champalimaud Foundation, 1400-038 Lisbon, Portugal.
² Center for Evolutionary Hologenomics, GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, 1353 Copenhagen, Denmark.
³ Centre for Ecological Research, 1113 Budapest, Hungary.
⁴ Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark.
⁵ Department of Computing, University of Turku, 20014 Turku, Finland.

PMID: 36637211
PMCID: PMC9907220
DOI: 10.1093/bioinformatics/btad021

Abstract

Motivation: Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets.

Results: Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments.

Availability and implementation: All data and processing scripts are available at this GitLab repository: https://gitlab.com/polavieja_lab/ml_multi-omics_review/ or in Zenodo: https://doi.org/10.5281/zenodo.7361807.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Overview of paper selection process

**Fig. 2.**
(a) Number of uses of each -omics category in the reviewed papers. (b) Number of -omics used per paper in the reviewed papers. (c) The number of appearances -omics pairs across papers

**Fig. 3.**
‘Shape’ of multi-omics datasets. Number of samples (x-axis), number of features (y-axis)

**Fig. 4.**
(a) Number of ML techniques being used more than once in the reviewed papers. The publications on ‘machine learning AND multi-omics AND integration’ are plotted in green, while the publications on ‘machine learning AND integration’ are plotted in purple. Significant differences were observed for autoencoders, and Cox proportional hazards (Cox PH) (Cox, 1972). Fisher’s exact test P-values of <0.0001 in both cases, satisfying the Bonferroni correction for this number of tests. Number in the reviewed multi-omics ML papers of ML goals (b) and labels used for classification (c)

**Fig. 5.**
Number of citations per year, of papers published in different years

**Fig. 6.**
Shapes in datasets. (a) A dataset where n ≫ P, the ideal ‘shape’ for many ML techniques. (b) In multi-omics analyses, researchers face very wide datasets, where n ≪ P. (c) Feature selection and extraction are often used to reduce the number of features. In feature selection, a subset of the original features is kept. In feature extraction, features are merged and transformed into a smaller number of new ones

See this image and copyright information in PMC

References

1. Athreya A. et al. (2018) Augmentation of physician assessments with multi-omics enhances predictability of drug response: a case study of major depressive disorder. IEEE Comput. Intell. Mag., 13, 20–31. - PMC - PubMed
1. Avsec Ž. et al. (2021) Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods, 18, 1196–1203. - PMC - PubMed
1. Bahdanau D. et al. (2014) Neural machine translation by jointly learning to align and translate. arXiv, arXiv:1409.0473, preprint: not peer reviewed.
1. Baker R. et al. (2022) Mechanistic models versus machine learning, a fight worth fighting for the biological community? R. Soc. Biol. Lett., 14(5), 20170660. - PMC - PubMed
1. Barsi S. et al. (2021) Modeling in systems biology: causal understanding before prediction? Patterns, 2, 100280. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dealing with dimensionality: the application of machine learning to multi-omics data

Affiliations

Dealing with dimensionality: the application of machine learning to multi-omics data

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Miscellaneous