Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets
- PMID: 38475170
- PMCID: PMC10933897
- DOI: 10.3390/s24051634
Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets
Abstract
The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS)-a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS consolidates over 41,000 cases from across repositories while achieving a high compression ratio relative to the 3.78 PB source data size. It offers sub-5-s query response times for interactive exploration. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines' scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration.
Keywords: cancer; cloud computing; data lake; data warehouse; deep learning; embeddings analysis; machine learning; multimodal; oncology.
Conflict of interest statement
The authors declare no conflict of interest.
Figures









Similar articles
-
Multimodal data integration for oncology in the era of deep neural networks: a review.Front Artif Intell. 2024 Jul 25;7:1408843. doi: 10.3389/frai.2024.1408843. eCollection 2024. Front Artif Intell. 2024. PMID: 39118787 Free PMC article. Review.
-
Blockchain-Powered Healthcare Systems: Enhancing Scalability and Security with Hybrid Deep Learning.Sensors (Basel). 2023 Sep 7;23(18):7740. doi: 10.3390/s23187740. Sensors (Basel). 2023. PMID: 37765797 Free PMC article.
-
A semantic proteomics dashboard (SemPoD) for data management in translational research.BMC Syst Biol. 2012;6 Suppl 3(Suppl 3):S20. doi: 10.1186/1752-0509-6-S3-S20. Epub 2012 Dec 17. BMC Syst Biol. 2012. PMID: 23282161 Free PMC article.
-
Deep learning-based multimodal spatial transcriptomics analysis for cancer.Adv Cancer Res. 2024;163:1-38. doi: 10.1016/bs.acr.2024.08.001. Epub 2024 Aug 22. Adv Cancer Res. 2024. PMID: 39271260 Free PMC article. Review.
-
RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4. BMC Bioinformatics. 2022. PMID: 35392801 Free PMC article.
Cited by
-
Mechanisms and technologies in cancer epigenetics.Front Oncol. 2025 Jan 7;14:1513654. doi: 10.3389/fonc.2024.1513654. eCollection 2024. Front Oncol. 2025. PMID: 39839798 Free PMC article. Review.
-
Vision-language models for medical report generation and visual question answering: a review.Front Artif Intell. 2024 Nov 19;7:1430984. doi: 10.3389/frai.2024.1430984. eCollection 2024. Front Artif Intell. 2024. PMID: 39628839 Free PMC article. Review.
-
Reliable Radiologic Skeletal Muscle Area Assessment - A Biomarker for Cancer Cachexia Diagnosis.medRxiv [Preprint]. 2025 Apr 25:2025.04.21.25326162. doi: 10.1101/2025.04.21.25326162. medRxiv. 2025. PMID: 40313262 Free PMC article. Preprint.
-
Innovations in heart failure management: The role of cutting-edge biomarkers and multi-omics integration.J Mol Cell Cardiol Plus. 2025 Mar 1;11:100290. doi: 10.1016/j.jmccpl.2025.100290. eCollection 2025 Mar. J Mol Cell Cardiol Plus. 2025. PMID: 40129519 Free PMC article. Review.
-
Multimodal data integration for oncology in the era of deep neural networks: a review.Front Artif Intell. 2024 Jul 25;7:1408843. doi: 10.3389/frai.2024.1408843. eCollection 2024. Front Artif Intell. 2024. PMID: 39118787 Free PMC article. Review.
References
-
- Waqas A., Dera D., Rasool G., Bouaynaya N.C., Fathallah-Shaykh H.M. Deep Learning for Biomedical Data Analysis. Springer; Cham, Switzerland: 2021. Brain Tumor Segmentation and Surveillance with Deep Artificial Neural Networks; pp. 311–350.
-
- Waqas A., Tripathi A., Ramachandran R.P., Stewart P., Rasool G. Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review. [(accessed on 30 January 2024)];arXiv. 2023 Available online: https://arxiv.org/abs/2303.06471.2303.06471 - PMC - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical