Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 2;196(4):332.
doi: 10.1007/s10661-024-12467-8.

Comparison of individual and ensemble machine learning models for prediction of sulphate levels in untreated and treated Acid Mine Drainage

Affiliations

Comparison of individual and ensemble machine learning models for prediction of sulphate levels in untreated and treated Acid Mine Drainage

Taskeen Hasrod et al. Environ Monit Assess. .

Abstract

Machine learning was used to provide data for further evaluation of potential extraction of octathiocane (S8), a commercially useful by-product, from Acid Mine Drainage (AMD) by predicting sulphate levels in an AMD water quality dataset. Individual ML regressor models, namely: Linear Regression (LR), Least Absolute Shrinkage and Selection Operator (LASSO), Ridge (RD), Elastic Net (EN), K-Nearest Neighbours (KNN), Support Vector Regression (SVR), Decision Tree (DT), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Multi-Layer Perceptron Artificial Neural Network (MLP) and Stacking Ensemble (SE-ML) combinations of these models were successfully used to predict sulphate levels. A SE-ML regressor trained on untreated AMD which stacked seven of the best-performing individual models and fed them to a LR meta-learner model was found to be the best-performing model with a Mean Squared Error (MSE) of 0.000011, Mean Absolute Error (MAE) of 0.002617 and R2 of 0.9997. Temperature (°C), Total Dissolved Solids (mg/L) and, importantly, iron (mg/L) were highly correlated to sulphate (mg/L) with iron showing a strong positive linear correlation that indicated dissolved products from pyrite oxidation. Ensemble learning (bagging, boosting and stacking) outperformed individual methods due to their combined predictive accuracies. Surprisingly, when comparing SE-ML that combined all models with SE-ML that combined only the best-performing models, there was only a slight difference in model accuracies which indicated that including bad-performing models in the stack had no adverse effect on its predictive performance.

Keywords: Acid Mine Drainage; Environmental chemistry; Machine learning; Regression; Stacking ensemble machine learning; Sulphate.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Scatter matrix (below main diagonal) and Pearson’s correlation matrix (above main diagonal) of Pump A indicate the interrelationships between water quality parameters. Diagonal histogram and density plots indicate the distribution of each parameter. For the correlation matrix, red circles are positive correlations, blue circles are negative correlations and larger circles indicate more strongly correlated variables
Fig. 2
Fig. 2
Dimensionality reduction and feature extraction results obtained from PCA for Pump A, a Bi-plot indicating the clustering of individual observations and its relation to the loadings plot. b Expanded view of the loadings plot indicating the relationship between parameters. c Elbow method plot indicating the optimal number of clusters. d Scree plot indicating the amount of variance explained by each principal component
Fig. 3
Fig. 3
Regression algorithm NMSE comparison for Pump A showing a all individual baseline models, b all models and a stacking regressor containing all the models and c all well-performing models and a stacking regressor containing all the best-performing models
Fig. 4
Fig. 4
Testing statistics (MSE, MAE and R2) accuracy comparison of regression models trained on Pump A. a, b Stacking regressor using all models. c, d Stacking regressor using only the best-performing models
Fig. 5
Fig. 5
Scatter matrix (below main diagonal) and Pearson’s correlation matrix (above main diagonal) of Pump B indicate the interrelationships between water quality parameters. Diagonal histogram and density plots indicate the distribution of each parameter. For the correlation matrix, red circles are positive correlations, blue circles are negative correlations and larger circles indicate more strongly correlated variables
Fig. 6
Fig. 6
Dimensionality reduction and feature extraction results obtained from PCA for Pump B. a Bi-plot indicating the clustering of individual observations and its relation to the loadings plot. b Expanded view of the loadings plot indicating the relationship between parameters. c Elbow method plot indicating the optimal number of clusters. d Scree plot indicating the amount of variance explained by each principal component
Fig. 7
Fig. 7
Regression algorithm NMSE comparison for Pump B showing a all individual baseline models, b all models and a stacking regressor containing all the models and c all good-performing models and a stacking regressor containing all the best-performing models
Fig. 8
Fig. 8
Testing statistics (MSE, MAE and R2) accuracy comparison of regression models trained on Pump B. a, b Stacking regressor using all models. c, d Stacking regressor using only the best-performing models
Fig. 9
Fig. 9
Scatter matrix (below main diagonal) and Pearson’s correlation matrix (above main diagonal) of Treated Water indicate the interrelationships between water quality parameters. Diagonal histogram and density plots indicate the distribution of each parameter. For the correlation matrix, red circles are positive correlations, blue circles are negative correlations and larger circles indicate more strongly correlated variables
Fig. 10
Fig. 10
Dimensionality reduction and feature extraction results obtained from PCA for the Treated Water. a Bi-plot indicating the clustering of individual observations and its relation to the loadings plot. b Expanded view of the loadings plot indicating the relationship between parameters. c Elbow method plot indicating the optimal number of clusters. d Scree plot indicating the amount of variance explained by each principal component
Fig. 11
Fig. 11
Regression algorithm NMSE comparison for Treated Water showing a all individual baseline models, b all models and a stacking regressor containing all the models and c all good-performing models and a stacking regressor containing all the best-performing models
Fig. 12
Fig. 12
Testing statistics (MSE, MAE and R2) accuracy comparison of regression models trained on Treated Water. a, b Stacking regressor using all models. c, d Stacking regressor using only the best-performing models

Similar articles

Cited by

References

    1. Alzubi J, Nayyar A, Kumar A. Machine learning from theory to algorithms: An overview. Journal of Physics: Conference Series. 2018;1142:012012. doi: 10.1088/1742-6596/1142/1/012012. - DOI
    1. Arora, S., & Keshari, A. K. (2023). Implementing machine learning algorithm to model reaeration coefficient of urbanized rivers. Environmental Modeling & Assessment10.1007/s10666-023-09895-0
    1. Awad, M., & Khanna, R. (2015). Support vector regression. In Efficient Learning Machines (67–80). Berkeley, CA: Apress. 10.1007/978-1-4302-5990-9_4
    1. Betrie GD, Tesfamariam S, Morin KA, Sadiq R. Predicting copper concentrations in acid mine drainage: A comparative analysis of five machine learning techniques. Environmental Monitoring and Assessment. 2013;185(5):4171–4182. doi: 10.1007/s10661-012-2859-7. - DOI - PubMed
    1. Betrie GD, Sadiq R, Morin KA, Tesfamariam S. Uncertainty quantification and integration of machine learning techniques for predicting acid rock drainage chemistry: A probability bounds approach. Science of the Total Environment. 2014;490:182–190. doi: 10.1016/j.scitotenv.2014.04.125. - DOI - PubMed