. 2024 Apr 24:7:1345179.

doi: 10.3389/frai.2024.1345179. eCollection 2024.

Xputer: bridging data gaps with NMF, XGBoost, and a streamlined GUI experience

Saleena Younus^{1

2

3}, Lars Rönnstrand^{1

2

3

4}, Julhash U Kazi^{1

2

3}

Affiliations

¹ Division of Translational Cancer Research, Department of Laboratory Medicine, Lund University, Lund, Sweden.
² Lund Stem Cell Center, Lund University, Lund, Sweden.
³ Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden.
⁴ Department of Hematology, Oncology and Radiation Physics, Skåne University Hospital, Lund, Sweden.

PMID: 38720912
PMCID: PMC11076752
DOI: 10.3389/frai.2024.1345179

Xputer: bridging data gaps with NMF, XGBoost, and a streamlined GUI experience

Saleena Younus et al. Front Artif Intell. 2024.

. 2024 Apr 24:7:1345179.

doi: 10.3389/frai.2024.1345179. eCollection 2024.

Authors

Saleena Younus^{1

2

3}, Lars Rönnstrand^{1

2

3

4}, Julhash U Kazi^{1

2

3}

Affiliations

¹ Division of Translational Cancer Research, Department of Laboratory Medicine, Lund University, Lund, Sweden.
² Lund Stem Cell Center, Lund University, Lund, Sweden.
³ Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden.
⁴ Department of Hematology, Oncology and Radiation Physics, Skåne University Hospital, Lund, Sweden.

PMID: 38720912
PMCID: PMC11076752
DOI: 10.3389/frai.2024.1345179

Abstract

The rapid proliferation of data across diverse fields has accentuated the importance of accurate imputation for missing values. This task is crucial for ensuring data integrity and deriving meaningful insights. In response to this challenge, we present Xputer, a novel imputation tool that adeptly integrates Non-negative Matrix Factorization (NMF) with the predictive strengths of XGBoost. One of Xputer's standout features is its versatility: it supports zero imputation, enables hyperparameter optimization through Optuna, and allows users to define the number of iterations. For enhanced user experience and accessibility, we have equipped Xputer with an intuitive Graphical User Interface (GUI) ensuring ease of handling, even for those less familiar with computational tools. In performance benchmarks, Xputer often outperforms IterativeImputer in terms of imputation accuracy. Furthermore, Xputer autonomously handles a diverse spectrum of data types, including categorical, continuous, and Boolean, eliminating the need for prior preprocessing. Given its blend of performance, flexibility, and user-friendly design, Xputer emerges as a state-of-the-art solution in the realm of data imputation.

Keywords: ensemble learning; imputation; matrix factorization; mix-type data; tabular data.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
The structure of Xputer. The core structure of Xputer includes **(A)** data preprocessing unit, **(B)** adaptive matrix factorization unit, **(C)** XGBoost hyperparameter search and model implementation unit, and **(D)** iterative imputation unit.

**Figure 2**
Comparative evaluation of various algorithms' functionality. **(A)** A composite dataset consisting of RNAseq and microarray data, dimensioned at 1770 X 400, underwent random masking to instigate missing values, ranging from 1% to 25%, thus creating 25 distinct data matrices. In the case of unsupervised learning algorithms, namely autoencoder, NMF, and PCA, the imputation of the entire dataset was accomplished via the column mean through SimpleImputer prior to the execution of the algorithm. For supervised algorithms, an initial identification of columns with NaNs was carried out. An individual column containing NaN was isolated and designated as the label, while the remainder of the data underwent imputation via the column mean with the aid of SimpleImputer. Following the preliminary imputation, the data was segregated into training and predictive datasets corresponding to non-NaN and NaN rows in the label column. **(B)** Subsequent to imputation with a specific algorithm, imputed values underwent a comparative analysis with their original counterparts, facilitating the determination of the mean squared error. Each box demonstrates 25 measurements employing missing values varying from 1%-25% (in increments of 1%), while the bar represents the minimum to maximum values range. **(C)** The duration necessitated to carry out imputation utilizing a specific algorithm. **(D–E)** Six imputed datasets were employed to ascertain the prediction performance of the Trametinib response via the XGBoost algorithm. ^**p < 0.01 and ^***p < 0.001.

**Figure 3**
Xputer hyperparameter evaluation. For a comprehensive assessment of Xputer's individual hyperparameters, six distinct datasets were employed, each infused with a diverse spectrum of artificially introduced missing values. These modified datasets were subsequently juxtaposed against their original values. The hyperparameters under examination encompassed **(A)** pre-imputation strategies, **(B)** the count of XGBoost models for ensembling, **(C)** matrix factorization techniques, **(D)** XGBoost hyperparameter optimization using Optuna, and **(E)** the total iteration count. The “Bank marketing” data was from archive.ics.uci.edu/dataset/222/bank+marketing, the “Breast cancer” data was collected from archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic, the “NATICUSdroid” data was from archive.ics.uci.edu/dataset/722/naticusdroid+android+permissions+dataset, the “PIMA Indian” data was downloaded from www.kaggle.com/datasets/uciml/pima-indians-diabetes-database, and the “Student's dropout” data was collected from archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success.

**Figure 4**
Assessment of imputation duration. In this analysis, we utilized data encompassing 7 to 97 features and sample sizes varying from 569 to 29,332. This evaluation covered various metrics: **(A)** time taken for imputation relative to the number of features, **(B)** time per unit of missing value, **(C)** time per individual value, and **(D)** imputation time for each feature in relation to the total number of features. Additionally, **(E)** we compared our results with the IterativeImputer's performance. Number of features: Bank Marketing−7, PIMA Indian – 8, Breast Cancer – 30, Students dropout−36, NATICUSdroid−86, Mixed Cancer−97; number of samples: Bank Marketing−4521, PIMA Indian−768, Breast Cancer−569, Students dropou−3630, NATICUSdroid−29332, Mixed Cancer−1770.

**Figure 5**
Comparative analysis of Xputer and Iterativeimputer. Using six continuous-data datasets, we assessed the performance of Xputer against IterativeImputer. Performance comparisons were conducted employing both the RMSE metric **(A)** and density plots **(B)** across varying percentages of missing values.

See this image and copyright information in PMC

Cited by

XeroGraph: enhancing data integrity in the presence of missing values with statistical and predictive analysis.
Mousafi Alasal L, Hammarlund EU, Pienta KJ, Rönnstrand L, Kazi JU. Mousafi Alasal L, et al. Bioinform Adv. 2025 Feb 21;5(1):vbaf035. doi: 10.1093/bioadv/vbaf035. eCollection 2025. Bioinform Adv. 2025. PMID: 40061871 Free PMC article.

References

1. Akiba T., Sano S., Yanase T., Ohta T., Kovama M. (2019). “Optuna: a next-generation hyperparameter optimization framework,” in Proceedings of the 25th {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining KDD 19, 2623–2631. 10.1145/3292500.3330701 - DOI
1. Anand V., Mamidi V. (2020). “Multiple imputation of missing data in marketing,” in 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI) (Sakhir: ), 16. 10.1109/ICDABI51230.2020.9325602 - DOI
1. Azur M. J., Stuart E. A., Frangakis C., Leaf P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20, 40–49. 10.1002/mpr.329 - DOI - PMC - PubMed
1. Bottomly D., Long N., Schultz A. R., Kurtz S. E., Tognon C. E., Johnson K., et al. . (2022). Integrative analysis of drug response and clinical outcome in acute myeloid leukemia. Cancer Cell 40, 850–864. - PMC - PubMed
1. Breiman L. (2001). Random Forests. Machine Learn. 45, 5–32. 10.1023/A:1010933404324 - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Xputer: bridging data gaps with NMF, XGBoost, and a streamlined GUI experience

Affiliations

Xputer: bridging data gaps with NMF, XGBoost, and a streamlined GUI experience

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources