Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 4;24(1):152.
doi: 10.1186/s12911-024-02544-w.

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

Affiliations

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

Meysam Ahangaran et al. BMC Med Inform Decis Mak. .

Abstract

Background: Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community..

Results: The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies.

Conclusion: Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

Keywords: Data quality measure; Data readiness; Feature engineering; Machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
DREAMER framework. a The DREAMER architecture workflow delineates the process for evaluating the readiness of a tabular dataset for machine learning. Input to DREAMER comprises the tabular dataset under scrutiny, which undergoes a sequence of automated procedures, culminating in the generation of a structured tabular dataset conducive to machine learning analysis. b The transformation of the data space D into data readiness space D’ involves constructing a new dataset from the master dataset. The master dataset dimension is denoted as N×M, while the data readiness dataset assumes dimensions of d×k, where d represents the number of random sub-tables and k indicates the number of data quality measures. c The process involves learning the weights of data quality measures from dataset D’ utilizing regression methodology. The average accuracy of clustering and classification serves as the target value for the regression algorithm. Subsequently, weighted total quality of sub-tables is computed post-weight learning to ascertain the best sub-table boasting the highest data quality. d The search space of DREAMER scales proportionally with the size of the master dataset (both in terms of rows and columns). We execute DREAMER R times to identify the best sub-table of each run as local maximum, subsequently selecting the sub-table exhibiting the highest data quality as a potential global maximum
None
Algorithm 1. DREAMER v1.0 (Dataset D)
Fig. 2
Fig. 2
Architecture of the DREAMER web framework. a DREAMER comprises three primary components: the front-end, API connection, and back-end. Within the front-end interface, users register and subsequently upload a raw CSV dataset file to the website. The API connection stage involves the generation of a JSON configuration file corresponding to the uploaded dataset, encompassing DREAMER parameters. This JSON file, along with the master dataset, is then transmitted to the server. On the back-end, the principal DREAMER process operates on the master dataset, resulting in the generation of a cleansed CSV file accompanied by various reports and statistical analyses. Upon completion of the DREAMER process, users receive email notifications and can access the cleansed dataset and reports within their profile section on the website. b DREAMER enhances the quality of raw datasets by elevating data quality scores and improving the accuracy of classification and clustering algorithms. It selectively removes correlated features and rows from the original dataset to enhance the overall quality score of the cleansed dataset
Fig. 3
Fig. 3
Convergence analysis of the DREAMER framework across multiple datasets. a Clustering and classification analysis in the FHS dataset as a function of the number of random sub-tables. b Plot showing the relationship between data quality scores and the number of random sub-tables in the FHS dataset. c Diagram illustrating the relationship between data quality weights and the number of random sub-tables in the FHS dataset. d Clustering and classification analysis in the ADNI dataset as a function of the number of random sub-tables. e Plot showing the relationship between data quality scores and the number of random sub-tables in the ADNI dataset. f Diagram illustrating the relationship between data quality weights and the number of random sub-tables in the ADNI dataset. g Clustering and classification analysis in the WDBC dataset as a function of the number of random sub-tables. h Plot showing the relationship between data quality scores and the number of random sub-tables in the WDBC dataset. i Diagram illustrating the relationship between data quality weights and the number of random sub-tables in the WDBC dataset
Fig. 4
Fig. 4
DREAMER framework evaluation across multiple datasets. a Comparison of raw and cleansed data quality scores for the FHS dataset, illustrating the impact of DREAMER’s data cleansing. b Comparison of classification and clustering accuracies between raw and cleansed data for the FHS dataset, providing insights into the impact of data cleansing on these metrics. c Comparison of raw and cleansed data quality scores for the ADNI dataset, illustrating the impact of DREAMER’s data cleansing. d Comparison of classification and clustering accuracies between raw and cleansed data for the ADNI dataset, providing insights into the impact of data cleansing on these metrics. e Comparison of raw and cleansed data quality scores for the WDBC dataset, illustrating the impact of DREAMER’s data cleansing. f Comparison of classification and clustering accuracies between raw and cleansed data for the WDBC dataset, providing insights into the impact of data cleansing on these metrics

Similar articles

Cited by

References

    1. Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2:160. doi: 10.1007/s42979-021-00592-x. - DOI - PMC - PubMed
    1. Lawrence ND. Data readiness levels. arXiv preprint arXiv:170502245. 2017.
    1. Dakka MA, Nguyen TV, Hall JMM, Diakiw SM, VerMilyea M, Linke R, et al. Automated detection of poor-quality data: case studies in healthcare. Sci Rep. 2021;11:18005. doi: 10.1038/s41598-021-97341-0. - DOI - PMC - PubMed
    1. Austin CC. A path to big data readiness. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE; 2018. pp. 4844–53.
    1. Barham H, Daim T. The use of readiness assessment for big data projects. Sustain Cities Soc. 2020;60:102233. doi: 10.1016/j.scs.2020.102233. - DOI

LinkOut - more resources