. 2024 Jun 4;24(1):152.

doi: 10.1186/s12911-024-02544-w.

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

Meysam Ahangaran¹, Hanzhi Zhu¹, Ruihui Li¹, Lingkai Yin¹, Joseph Jang¹, Arnav P Chaudhry¹, Lindsay A Farrer^{2

3

4

5

6

7}, Rhoda Au^{1

2

4

6

7

8}, Vijaya B Kolachalama^{9

10

11

12}

Affiliations

¹ Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
² Department of Neurology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
³ Department Ophthalmology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
⁴ Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA.
⁵ Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
⁶ Boston University Alzheimer's Disease Research Center, Boston, MA, USA.
⁷ The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
⁸ Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
⁹ Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA. vkola@bu.edu.
¹⁰ Boston University Alzheimer's Disease Research Center, Boston, MA, USA. vkola@bu.edu.
¹¹ Department of Computer Science, Boston University, Boston, MA, USA. vkola@bu.edu.
¹² Faculty of Computing & Data Sciences, Boston University, Boston, MA, 02215, USA. vkola@bu.edu.

PMID: 38831432
PMCID: PMC11149315
DOI: 10.1186/s12911-024-02544-w

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

Meysam Ahangaran et al. BMC Med Inform Decis Mak. 2024.

. 2024 Jun 4;24(1):152.

doi: 10.1186/s12911-024-02544-w.

Authors

Meysam Ahangaran¹, Hanzhi Zhu¹, Ruihui Li¹, Lingkai Yin¹, Joseph Jang¹, Arnav P Chaudhry¹, Lindsay A Farrer^{2

3

4

5

6

7}, Rhoda Au^{1

2

4

6

7

8}, Vijaya B Kolachalama^{9

10

11

12}

Affiliations

¹ Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
² Department of Neurology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
³ Department Ophthalmology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
⁴ Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA.
⁵ Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
⁶ Boston University Alzheimer's Disease Research Center, Boston, MA, USA.
⁷ The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
⁸ Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
⁹ Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA. vkola@bu.edu.
¹⁰ Boston University Alzheimer's Disease Research Center, Boston, MA, USA. vkola@bu.edu.
¹¹ Department of Computer Science, Boston University, Boston, MA, USA. vkola@bu.edu.
¹² Faculty of Computing & Data Sciences, Boston University, Boston, MA, 02215, USA. vkola@bu.edu.

PMID: 38831432
PMCID: PMC11149315
DOI: 10.1186/s12911-024-02544-w

Abstract

Background: Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community..

Results: The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies.

Conclusion: Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

Keywords: Data quality measure; Data readiness; Feature engineering; Machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
DREAMER framework. a The DREAMER architecture workflow delineates the process for evaluating the readiness of a tabular dataset for machine learning. Input to DREAMER comprises the tabular dataset under scrutiny, which undergoes a sequence of automated procedures, culminating in the generation of a structured tabular dataset conducive to machine learning analysis. b The transformation of the data space D into data readiness space D’ involves constructing a new dataset from the master dataset. The master dataset dimension is denoted as *N×M*, while the data readiness dataset assumes dimensions of *d×k*, where d represents the number of random sub-tables and k indicates the number of data quality measures. c The process involves learning the weights of data quality measures from dataset D’ utilizing regression methodology. The average accuracy of clustering and classification serves as the target value for the regression algorithm. Subsequently, weighted total quality of sub-tables is computed post-weight learning to ascertain the best sub-table boasting the highest data quality. d The search space of DREAMER scales proportionally with the size of the master dataset (both in terms of rows and columns). We execute DREAMER R times to identify the best sub-table of each run as local maximum, subsequently selecting the sub-table exhibiting the highest data quality as a potential global maximum

None — **Algorithm 1.** DREAMER v1.0 (Dataset D)

**Fig. 2**
Architecture of the DREAMER web framework. a DREAMER comprises three primary components: the front-end, API connection, and back-end. Within the front-end interface, users register and subsequently upload a raw CSV dataset file to the website. The API connection stage involves the generation of a JSON configuration file corresponding to the uploaded dataset, encompassing DREAMER parameters. This JSON file, along with the master dataset, is then transmitted to the server. On the back-end, the principal DREAMER process operates on the master dataset, resulting in the generation of a cleansed CSV file accompanied by various reports and statistical analyses. Upon completion of the DREAMER process, users receive email notifications and can access the cleansed dataset and reports within their profile section on the website. b DREAMER enhances the quality of raw datasets by elevating data quality scores and improving the accuracy of classification and clustering algorithms. It selectively removes correlated features and rows from the original dataset to enhance the overall quality score of the cleansed dataset

**Fig. 3**
Convergence analysis of the DREAMER framework across multiple datasets. a Clustering and classification analysis in the FHS dataset as a function of the number of random sub-tables. b Plot showing the relationship between data quality scores and the number of random sub-tables in the FHS dataset. c Diagram illustrating the relationship between data quality weights and the number of random sub-tables in the FHS dataset. d Clustering and classification analysis in the ADNI dataset as a function of the number of random sub-tables. e Plot showing the relationship between data quality scores and the number of random sub-tables in the ADNI dataset. f Diagram illustrating the relationship between data quality weights and the number of random sub-tables in the ADNI dataset. g Clustering and classification analysis in the WDBC dataset as a function of the number of random sub-tables. h Plot showing the relationship between data quality scores and the number of random sub-tables in the WDBC dataset. i Diagram illustrating the relationship between data quality weights and the number of random sub-tables in the WDBC dataset

**Fig. 4**
DREAMER framework evaluation across multiple datasets. a Comparison of raw and cleansed data quality scores for the FHS dataset, illustrating the impact of DREAMER’s data cleansing. b Comparison of classification and clustering accuracies between raw and cleansed data for the FHS dataset, providing insights into the impact of data cleansing on these metrics. c Comparison of raw and cleansed data quality scores for the ADNI dataset, illustrating the impact of DREAMER’s data cleansing. d Comparison of classification and clustering accuracies between raw and cleansed data for the ADNI dataset, providing insights into the impact of data cleansing on these metrics. e Comparison of raw and cleansed data quality scores for the WDBC dataset, illustrating the impact of DREAMER’s data cleansing. f Comparison of classification and clustering accuracies between raw and cleansed data for the WDBC dataset, providing insights into the impact of data cleansing on these metrics

See this image and copyright information in PMC

References

1. Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2:160. doi: 10.1007/s42979-021-00592-x. - DOI - PMC - PubMed
1. Lawrence ND. Data readiness levels. arXiv preprint arXiv:170502245. 2017.
1. Dakka MA, Nguyen TV, Hall JMM, Diakiw SM, VerMilyea M, Linke R, et al. Automated detection of poor-quality data: case studies in healthcare. Sci Rep. 2021;11:18005. doi: 10.1038/s41598-021-97341-0. - DOI - PMC - PubMed
1. Austin CC. A path to big data readiness. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE; 2018. pp. 4844–53.
1. Barham H, Daim T. The use of readiness assessment for big data projects. Sustain Cities Soc. 2020;60:102233. doi: 10.1016/j.scs.2020.102233. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

Affiliations

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources