DREAMER: a computational framework to evaluate readiness of datasets for machine learning
- PMID: 38831432
- PMCID: PMC11149315
- DOI: 10.1186/s12911-024-02544-w
DREAMER: a computational framework to evaluate readiness of datasets for machine learning
Abstract
Background: Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community..
Results: The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies.
Conclusion: Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.
Keywords: Data quality measure; Data readiness; Feature engineering; Machine learning.
© 2024. The Author(s).
Conflict of interest statement
The authors declare no competing interests.
Figures





Similar articles
-
A comparative study of supervised and unsupervised machine learning algorithms applied to human microbiome.Clin Ter. 2024 May-Jun;175(3):98-116. doi: 10.7417/CT.2024.5051. Clin Ter. 2024. PMID: 38767067
-
Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2. J Biomed Inform. 2023. PMID: 38048895
-
ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data.Gigascience. 2020 Apr 1;9(4):giaa026. doi: 10.1093/gigascience/giaa026. Gigascience. 2020. PMID: 32249316 Free PMC article.
-
The Utility of Unsupervised Machine Learning in Anatomic Pathology.Am J Clin Pathol. 2022 Jan 6;157(1):5-14. doi: 10.1093/ajcp/aqab085. Am J Clin Pathol. 2022. PMID: 34302331 Review.
-
A Comprehensive Review on Machine Learning in Healthcare Industry: Classification, Restrictions, Opportunities and Challenges.Sensors (Basel). 2023 Apr 22;23(9):4178. doi: 10.3390/s23094178. Sensors (Basel). 2023. PMID: 37177382 Free PMC article. Review.
Cited by
-
The cognitive impacts of large language model interactions on problem solving and decision making using EEG analysis.Front Comput Neurosci. 2025 Jul 16;19:1556483. doi: 10.3389/fncom.2025.1556483. eCollection 2025. Front Comput Neurosci. 2025. PMID: 40741073 Free PMC article.
-
The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.BioData Min. 2025 Jan 9;18(1):1. doi: 10.1186/s13040-024-00412-x. BioData Min. 2025. PMID: 39780220 Free PMC article.
References
-
- Lawrence ND. Data readiness levels. arXiv preprint arXiv:170502245. 2017.
-
- Austin CC. A path to big data readiness. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE; 2018. pp. 4844–53.
-
- Barham H, Daim T. The use of readiness assessment for big data projects. Sustain Cities Soc. 2020;60:102233. doi: 10.1016/j.scs.2020.102233. - DOI
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources