Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 24:7:1345179.
doi: 10.3389/frai.2024.1345179. eCollection 2024.

Xputer: bridging data gaps with NMF, XGBoost, and a streamlined GUI experience

Affiliations

Xputer: bridging data gaps with NMF, XGBoost, and a streamlined GUI experience

Saleena Younus et al. Front Artif Intell. .

Abstract

The rapid proliferation of data across diverse fields has accentuated the importance of accurate imputation for missing values. This task is crucial for ensuring data integrity and deriving meaningful insights. In response to this challenge, we present Xputer, a novel imputation tool that adeptly integrates Non-negative Matrix Factorization (NMF) with the predictive strengths of XGBoost. One of Xputer's standout features is its versatility: it supports zero imputation, enables hyperparameter optimization through Optuna, and allows users to define the number of iterations. For enhanced user experience and accessibility, we have equipped Xputer with an intuitive Graphical User Interface (GUI) ensuring ease of handling, even for those less familiar with computational tools. In performance benchmarks, Xputer often outperforms IterativeImputer in terms of imputation accuracy. Furthermore, Xputer autonomously handles a diverse spectrum of data types, including categorical, continuous, and Boolean, eliminating the need for prior preprocessing. Given its blend of performance, flexibility, and user-friendly design, Xputer emerges as a state-of-the-art solution in the realm of data imputation.

Keywords: ensemble learning; imputation; matrix factorization; mix-type data; tabular data.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
The structure of Xputer. The core structure of Xputer includes (A) data preprocessing unit, (B) adaptive matrix factorization unit, (C) XGBoost hyperparameter search and model implementation unit, and (D) iterative imputation unit.
Figure 2
Figure 2
Comparative evaluation of various algorithms' functionality. (A) A composite dataset consisting of RNAseq and microarray data, dimensioned at 1770 X 400, underwent random masking to instigate missing values, ranging from 1% to 25%, thus creating 25 distinct data matrices. In the case of unsupervised learning algorithms, namely autoencoder, NMF, and PCA, the imputation of the entire dataset was accomplished via the column mean through SimpleImputer prior to the execution of the algorithm. For supervised algorithms, an initial identification of columns with NaNs was carried out. An individual column containing NaN was isolated and designated as the label, while the remainder of the data underwent imputation via the column mean with the aid of SimpleImputer. Following the preliminary imputation, the data was segregated into training and predictive datasets corresponding to non-NaN and NaN rows in the label column. (B) Subsequent to imputation with a specific algorithm, imputed values underwent a comparative analysis with their original counterparts, facilitating the determination of the mean squared error. Each box demonstrates 25 measurements employing missing values varying from 1%-25% (in increments of 1%), while the bar represents the minimum to maximum values range. (C) The duration necessitated to carry out imputation utilizing a specific algorithm. (D–E) Six imputed datasets were employed to ascertain the prediction performance of the Trametinib response via the XGBoost algorithm. **p < 0.01 and ***p < 0.001.
Figure 3
Figure 3
Xputer hyperparameter evaluation. For a comprehensive assessment of Xputer's individual hyperparameters, six distinct datasets were employed, each infused with a diverse spectrum of artificially introduced missing values. These modified datasets were subsequently juxtaposed against their original values. The hyperparameters under examination encompassed (A) pre-imputation strategies, (B) the count of XGBoost models for ensembling, (C) matrix factorization techniques, (D) XGBoost hyperparameter optimization using Optuna, and (E) the total iteration count. The “Bank marketing” data was from archive.ics.uci.edu/dataset/222/bank+marketing, the “Breast cancer” data was collected from archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic, the “NATICUSdroid” data was from archive.ics.uci.edu/dataset/722/naticusdroid+android+permissions+dataset, the “PIMA Indian” data was downloaded from www.kaggle.com/datasets/uciml/pima-indians-diabetes-database, and the “Student's dropout” data was collected from archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success.
Figure 4
Figure 4
Assessment of imputation duration. In this analysis, we utilized data encompassing 7 to 97 features and sample sizes varying from 569 to 29,332. This evaluation covered various metrics: (A) time taken for imputation relative to the number of features, (B) time per unit of missing value, (C) time per individual value, and (D) imputation time for each feature in relation to the total number of features. Additionally, (E) we compared our results with the IterativeImputer's performance. Number of features: Bank Marketing−7, PIMA Indian – 8, Breast Cancer – 30, Students dropout−36, NATICUSdroid−86, Mixed Cancer−97; number of samples: Bank Marketing−4521, PIMA Indian−768, Breast Cancer−569, Students dropou−3630, NATICUSdroid−29332, Mixed Cancer−1770.
Figure 5
Figure 5
Comparative analysis of Xputer and Iterativeimputer. Using six continuous-data datasets, we assessed the performance of Xputer against IterativeImputer. Performance comparisons were conducted employing both the RMSE metric (A) and density plots (B) across varying percentages of missing values.

Similar articles

Cited by

References

    1. Akiba T., Sano S., Yanase T., Ohta T., Kovama M. (2019). “Optuna: a next-generation hyperparameter optimization framework,” in Proceedings of the 25th {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining KDD 19, 2623–2631. 10.1145/3292500.3330701 - DOI
    1. Anand V., Mamidi V. (2020). “Multiple imputation of missing data in marketing,” in 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI) (Sakhir: ), 16. 10.1109/ICDABI51230.2020.9325602 - DOI
    1. Azur M. J., Stuart E. A., Frangakis C., Leaf P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20, 40–49. 10.1002/mpr.329 - DOI - PMC - PubMed
    1. Bottomly D., Long N., Schultz A. R., Kurtz S. E., Tognon C. E., Johnson K., et al. . (2022). Integrative analysis of drug response and clinical outcome in acute myeloid leukemia. Cancer Cell 40, 850–864. - PMC - PubMed
    1. Breiman L. (2001). Random Forests. Machine Learn. 45, 5–32. 10.1023/A:1010933404324 - DOI

LinkOut - more resources