. 2024 Mar 2;24(5):1634.

doi: 10.3390/s24051634.

Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets

Aakash Tripathi^{1

2}, Asim Waqas^{1

2}, Kavya Venkatesan¹, Yasin Yilmaz², Ghulam Rasool^{1

2

3

4}

Affiliations

¹ Department of Machine Learning, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA.
² Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA.
³ Department of Neuro-Oncology, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA.
⁴ Department of Oncologic Sciences, University of South Florida, Tampa, FL 33612, USA.

PMID: 38475170
PMCID: PMC10933897
DOI: 10.3390/s24051634

Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets

Aakash Tripathi et al. Sensors (Basel). 2024.

. 2024 Mar 2;24(5):1634.

doi: 10.3390/s24051634.

Authors

Aakash Tripathi^{1

2}, Asim Waqas^{1

2}, Kavya Venkatesan¹, Yasin Yilmaz², Ghulam Rasool^{1

2

3

4}

Affiliations

¹ Department of Machine Learning, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA.
² Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA.
³ Department of Neuro-Oncology, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA.
⁴ Department of Oncologic Sciences, University of South Florida, Tampa, FL 33612, USA.

PMID: 38475170
PMCID: PMC10933897
DOI: 10.3390/s24051634

Abstract

The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS)-a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS consolidates over 41,000 cases from across repositories while achieving a high compression ratio relative to the 3.78 PB source data size. It offers sub-5-s query response times for interactive exploration. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines' scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration.

Keywords: cancer; cloud computing; data lake; data warehouse; deep learning; embeddings analysis; machine learning; multimodal; oncology.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
In the first stage, data is obtained from a primary source that may consist of patients who comprise the bulk of your study. Still, access to additional data sources may be needed to augment the original dataset, which may be restricted, or a minimal amount of such data that aligns with the original cohort may be present, limiting the ability to develop comprehensive models. Once collected, both sets undergo their respective data wrangling. It can be time-consuming and resource-intensive when new data sources are introduced due to different data cataloging and formatting pipelines. Next, the model development phase consists of feature engineering and model training. However, once developed, the model’s effectiveness is limited by the scope of the original dataset. Validating and expanding existing models with new data sources requires extensive data wrangling and matching the data pipelines of the original researcher, leading to longer validation times and decreased productivity. Consequently, the broader applicability of these models remains unexplored mainly, limiting the potential for advances in understanding complex diseases and developing effective treatments. In contrast, MINDS integrates multimodal data into a unified platform, and researchers can employ early-stage data fusion techniques that can harness the rich potential of correlated multimodal data to improve inference and decision-making. For instance, in the context of medical applications like cancer research, the integration of MR, X-ray, and ultrasound imaging data with different modalities of data acquired not from scans, such as histopathology slides, can yield more accurate and comprehensive insights into patient conditions compared to relying on any one modality alone.

**Figure 2**
Genome Characterization Pipeline is illustrated as an example of data characterization. Data source sites collect tumor tissue samples and normal tissue from participating patients. Biospecimen Core Resource (BCR) collects and processes the tissue samples and collects, harmonizes, and curates clinical data. Genome Characterization Centers (GCCs) generate data such as whole genome sequencing, total RNA and microRNA sequencing, methylation arrays, and single-cell sequencing from the tissue samples received from the BCR. At the Genomic Data Analysis stage, the raw data from the previous stage is transformed into meaningful biological information. Data generated by the pipeline are made available via the GDC for use by researchers worldwide. Center for Cancer Genomics (CCG) Genome Characterization Pipeline was originally published by the National Cancer Institute [21].

**Figure 3**
The 7 Vs of Big Data: Volume relates to the data size, with more data, the models can learn more and perform better. Variety refers to the data types we deal with, each data type presents unique challenges and opportunities. Velocity considers the speed at which the data is accumulated, the learning models need to remain current and adaptable. Veracity concerns the quality and integrity of the data, data must be credible and high-quality. Value focuses on the utility and benefits of the data. Variability pertains to the data volatility that changes in both temporal and spatial domains. Visualization depicts insights through visual representations and illustrations [40].

**Figure 4**
MINDS architecture implements a 3-stage pipeline designed to optimize data aggregation, data preparation, and data serving of multimodal datasets. Stage 1 comprises data acquisition and involves acquiring structured and semi-structured data from sources like GDC, including clinical records and biospecimen metadata. These are gathered, normalized, and securely stored in cloud object storage. Stage 2 consists of data processing. The raw data is processed by extract, transform, load (ETL) tools cataloging into data lakes, transforming into structured relational formats, and loading into optimized data warehouses, generating analysis-ready clinical data. Stage 3 consists of data serving. The clinical data is served directly to researchers for preliminary exploration and visualization. They can also build patient cohorts by querying the selection criteria, and MINDS will pull corresponding unstructured data like images from connected repositories, e.g., IDC.

**Figure 5**
Overview of the MINDS architecture implemented on AWS. (A) Data from multiple oncology sources is acquired. The pipeline for structured data is currently configured with GDC, with the ability to integrate other platforms, such as the University of California Santa Cruz Xena and cBIO portals. (B) The structured data from the source is acquired in an AWS Lake where multiple components such as S3 Bucket, Glue, and Lambda catalog and process the data. (C) Next, the Data Warehouse uses RDS and Redshift for structured data warehousing in the form of relational schema. The cataloged data is available to Athena and Quicksight for analytics and visualization. (D) The users can directly query the structured data for visualization. All unstructured data download pipelines using the Data Commons APIs from Cancer Research Data Commons (CRDC) are also shown. Using SQL queries, users can request all modalities data associated with the cohort. Resultantly, all the data from PDC, GDC, and IDC are pulled together, harmonized, formatted, and presented to the user ready to use for machine learning pre-processing.

**Figure 6**
Quicksight analytics and visualization generated using clinical data from MINDS, filtered based on the condition mentioned in each sub-figure. Showcasing data mining and hypothesis generation capacity based on querying MINDS’ consolidated case data and deriving tangible trends. The presented visualizations offer glimpses into the extensive cohort analytics and visualization capacities, where MINDS aims to accelerate discoveries by surfacing multidimensional correlations. (a) Count of Records by the Year of Death. (b) Count of Records by the Cause of Death. (c) Count of Records by Gender. (d) Count of Records by Ethnicity. (e) Count of Records by Classification of Tumor. (f) Count of Records by Progression or Recurrence.

**Figure 7**
The AWS Glue crawler automates ETL in MINDS through a 5-step workflow. (1) Establish secure database connections. (2) Apply custom classifiers to catalog raw data. (3) Transform data using built-in classifiers. (4) Merge classifier outputs into unified databases. (5) Upload the final catalog to processed data stores. The proposed workflow extracts, standardizes, and structures heterogeneous multimodal data from diverse sources to enable advanced analytics applications.

**Figure 8**
Demonstrating the feasibility of deploying MINDS across cloud platforms, this diagram shows the mapping of key AWS services leveraged in the current implementation to their corresponding managed offerings on Google Cloud Platform (GCP). By abstracting underlying infrastructure into modular cloud services with standardized programmatic interfaces, MINDS aims for platform agnosticism without vendor lock-in. While technical considerations around service limits, access controls, and performance tuning differ across providers, the high-level architecture and methodology remain consistent. Through this interoperability, MINDS can ingest, process, analyze, and serve integrated multimodal datasets spanning storage systems, data pipelines, warehouses, and analytics products from multiple public cloud platforms.

**Figure 9**
Overview of the workflow in MINDS, starting from user query generation through returning the cohort data, structured and unstructured. The system starts with a user submitting an analytical query specifying cohort criteria. If the user requests structured data, the query is sent to a function that executes it against the consolidated EHR and clinical databases, returning a Pandas data frame containing matching patient records. Alternatively, if the user requests unstructured data for the cohort, the query is sent to another function that extracts a list of unique case IDs for patients meeting the criteria. This case list is then used to retrieve all associated unstructured data objects like medical images, genomic sequences, and pathology slides for those patients from connected repositories, including GDC, PDC, and IDC. The cohort-specific unstructured data extract is returned to the user for further analysis.

See this image and copyright information in PMC

Cited by

Mechanisms and technologies in cancer epigenetics.
Sherif ZA, Ogunwobi OO, Ressom HW. Sherif ZA, et al. Front Oncol. 2025 Jan 7;14:1513654. doi: 10.3389/fonc.2024.1513654. eCollection 2024. Front Oncol. 2025. PMID: 39839798 Free PMC article. Review.
Vision-language models for medical report generation and visual question answering: a review.
Hartsock I, Rasool G. Hartsock I, et al. Front Artif Intell. 2024 Nov 19;7:1430984. doi: 10.3389/frai.2024.1430984. eCollection 2024. Front Artif Intell. 2024. PMID: 39628839 Free PMC article. Review.
Reliable Radiologic Skeletal Muscle Area Assessment - A Biomarker for Cancer Cachexia Diagnosis.
Ahmed S, Parker N, Park M, Jeong D, Peres L, Davis EW, Permuth JB, Siegel E, Schabath MB, Yilmaz Y, Rasool G. Ahmed S, et al. medRxiv [Preprint]. 2025 Apr 25:2025.04.21.25326162. doi: 10.1101/2025.04.21.25326162. medRxiv. 2025. PMID: 40313262 Free PMC article. Preprint.
Innovations in heart failure management: The role of cutting-edge biomarkers and multi-omics integration.
Bastos JM, Colaço B, Baptista R, Gavina C, Vitorino R. Bastos JM, et al. J Mol Cell Cardiol Plus. 2025 Mar 1;11:100290. doi: 10.1016/j.jmccpl.2025.100290. eCollection 2025 Mar. J Mol Cell Cardiol Plus. 2025. PMID: 40129519 Free PMC article. Review.
Multimodal data integration for oncology in the era of deep neural networks: a review.
Waqas A, Tripathi A, Ramachandran RP, Stewart PA, Rasool G. Waqas A, et al. Front Artif Intell. 2024 Jul 25;7:1408843. doi: 10.3389/frai.2024.1408843. eCollection 2024. Front Artif Intell. 2024. PMID: 39118787 Free PMC article. Review.

See all "Cited by" articles

References

1. Boehm K., Khosravi P., Vanguri R., Gao J., Shah S. Harnessing multimodal data integration to advance precision oncology. Nat. Rev. Cancer. 2021;22:114–126. doi: 10.1038/s41568-021-00408-3. - DOI - PMC - PubMed
1. Waqas A., Dera D., Rasool G., Bouaynaya N.C., Fathallah-Shaykh H.M. Deep Learning for Biomedical Data Analysis. Springer; Cham, Switzerland: 2021. Brain Tumor Segmentation and Surveillance with Deep Artificial Neural Networks; pp. 311–350.
1. Ektefaie Y., Dasoulas G., Noori A., Farhat M., Zitnik M. Multimodal learning with graphs. Nat. Mach. Intell. 2023;5:340–350. doi: 10.1038/s42256-023-00624-6. - DOI - PMC - PubMed
1. Lipkova J., Chen R.J., Chen B., Lu M.Y., Barbieri M., Shao D., Vaidya A.J., Chen C., Zhuang L., Williamson D.F., et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell. 2022;40:1095–1110. doi: 10.1016/j.ccell.2022.09.012. - DOI - PMC - PubMed
1. Waqas A., Tripathi A., Ramachandran R.P., Stewart P., Rasool G. Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review. [(accessed on 30 January 2024)];arXiv. 2023 Available online: https://arxiv.org/abs/2303.06471.2303.06471 - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets

Affiliations

Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical