Robust imputation method with context-aware voting ensemble model for management of water-quality data
- PMID: 37499538
- DOI: 10.1016/j.watres.2023.120369
Robust imputation method with context-aware voting ensemble model for management of water-quality data
Abstract
Water-quality monitoring and management are crucial for ensuring the safety and sustainability of water resources. However, missing data is a frequent problem in water-quality datasets, which can result in biased results in hydrological modeling and data analysis. While classic statistical methods and emerging machine/deep learning methods have been applied for imputing missing values, most existing studies perform well in specific missing scenarios, but not in universal scenarios. Therefore, existing imputation methods often fail to robustly impute missing values across various scenarios. To address the problem, we propose an imputation method that uses a context-aware voting-ensemble model to dynamically select optimal weights to integrate various imputation models across different missingness scenarios. For first identify the attributes of missingness scenarios that influence imputation accuracy. Then after introducing missing values in collected data according to the missingness scenarios, we measure the accuracy of various imputation models across the missingness scenarios. Weights of imputation models are optimized by estimating non-linear functions with regression model that can capture relationships between missingness scenarios and imputation accuracies of models. The final imputed value of the ensemble model for a missing scenario can be determined by multiplying each imputation model's weight by its imputed value, then summing the products. The method inherits the advantages of state-of-art imputation models, including the ability to learn long-term dependencies in time series, as well as the flexibility of using a dynamic weighting strategy to process various missingness scenarios. To validate the superiority of our method, we evaluate on real-world water-quality data from a river in South Korea. The proposed method achieves higher accuracy and lower variation of imputed values than baseline models across various missingness scenarios. Furthermore, we showed the applicability of our method to various hydrological environment by validating our method on industrial water quality dataset. This study highlights the potential value of the ensemble model with dynamic weighting in robust imputation of water-quality data.
Keywords: Data imputation; Data management; Data quality; Missing data; Water quality.
Copyright © 2023 Elsevier Ltd. All rights reserved.
Conflict of interest statement
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Similar articles
-
The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study.J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24. J Clin Epidemiol. 2024. PMID: 39326470
-
Outcome-sensitive multiple imputation: a simulation study.BMC Med Res Methodol. 2017 Jan 9;17(1):2. doi: 10.1186/s12874-016-0281-5. BMC Med Res Methodol. 2017. PMID: 28068910 Free PMC article.
-
A nonparametric multiple imputation approach for missing categorical data.BMC Med Res Methodol. 2017 Jun 6;17(1):87. doi: 10.1186/s12874-017-0360-2. BMC Med Res Methodol. 2017. PMID: 28587662 Free PMC article.
-
Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review.BMC Med Res Methodol. 2024 Aug 28;24(1):188. doi: 10.1186/s12874-024-02310-6. BMC Med Res Methodol. 2024. PMID: 39198744 Free PMC article.
-
Imputation of missing covariate in randomized controlled trials with a continuous outcome: Scoping review and new results.Pharm Stat. 2020 Nov;19(6):840-860. doi: 10.1002/pst.2041. Epub 2020 Jun 8. Pharm Stat. 2020. PMID: 32510791 Free PMC article.
Cited by
-
Development of a Predictive Model for N-Dealkylation of Amine Contaminants Based on Machine Learning Methods.Toxics. 2024 Dec 22;12(12):931. doi: 10.3390/toxics12120931. Toxics. 2024. PMID: 39771146 Free PMC article.
-
Weighted Domain Adaptation Using the Graph-Structured Dataset Representation for Machinery Fault Diagnosis under Varying Operating Conditions.Sensors (Basel). 2023 Dec 28;24(1):188. doi: 10.3390/s24010188. Sensors (Basel). 2023. PMID: 38203050 Free PMC article.
-
Xputer: bridging data gaps with NMF, XGBoost, and a streamlined GUI experience.Front Artif Intell. 2024 Apr 24;7:1345179. doi: 10.3389/frai.2024.1345179. eCollection 2024. Front Artif Intell. 2024. PMID: 38720912 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources