Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
- PMID: 37828360
- PMCID: PMC10570374
- DOI: 10.1038/s41598-023-44326-w
Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
Abstract
This study conducted a comprehensive analysis of multiple supervised machine learning models, regressors and classifiers, to accurately predict diamond prices. Diamond pricing is a complex task due to the non-linear relationships between key features such as carat, cut, clarity, table, and depth. The analysis aimed to develop an accurate predictive model by utilizing both regression and classification approaches. To preprocess the data, the study employed various techniques. The work addressed outliers, standardized the predictors, performed median imputation of missing values, and resolved multicollinearity issues. Equal-width binning on the cut variable was performed to handle class imbalance. Correlation-based feature selection was utilized to eliminate highly correlated variables, ensuring that only relevant features were included in the models. Outliers were handled using the inter-quartile range method, and numerical features were normalized through standardization. Missing values in numerical features were imputed using the median, preserving the integrity of the dataset. Among the models evaluated, the RF regressor exhibited exceptional performance. It achieved the lowest root mean squared error (RMSE) of 523.50, indicating superior accuracy compared to the other models. The RF regressor also obtained a high R-squared ([Formula: see text]) score of 0.985, suggesting it explained a significant portion of the variance in diamond prices. Furthermore, the area under the curve with RF classifier for the test set was 1.00 [Formula: see text], indicating perfect classification performance. These results solidify the RF's position as the best-performing model in terms of accuracy and predictive power, both in regression and classification. The MLP regressor showed promising results with an RMSE of 563.74 and an [Formula: see text] score of 0.980, demonstrating its ability to capture the complex relationships in the data. Although it achieved slightly higher errors than the RF regressor, further analysis is needed to determine its suitability and potential advantages compared to the RF regressor. The XGBoost Regressor achieved an RMSE of 612.88 and an [Formula: see text] score of 0.972, indicating its effectiveness in predicting diamond prices but with slightly higher errors compared to the RF regressor. The Boosted Decision Tree Regressor had an RMSE of 711.31 and an [Formula: see text] score of 0.968, demonstrating its ability to capture some of the underlying patterns but with higher errors than the RF and XGBoost models. In contrast, the KNN regressor yielded a higher RMSE of 1346.65 and a lower [Formula: see text] score of 0.887, indicating its inferior performance in accurately predicting diamond prices compared to the other models. Similarly, the Linear Regression model performed similarly to the KNN regressor, with an RMSE of 1395.41 and an [Formula: see text] score of 0.876. The Support Vector Regression model showed the highest RMSE of 3044.49 and the lowest [Formula: see text] score of 0.421, indicating its limited effectiveness in capturing the complex relationships in the data. Overall, the study demonstrates that the RF outperforms the other models in terms of accuracy and predictive power, as evidenced by its lowest RMSE, highest [Formula: see text] score, and perfect classification performance. This highlights its suitability for accurately predicting diamond prices. The study not only provides an effective tool for the diamond industry but also emphasizes the importance of considering both regression and classification approaches in developing accurate predictive models. The findings contribute valuable insights for pricing strategies, market trends, and decision-making processes in the diamond industry and related fields.
© 2023. Springer Nature Limited.
Conflict of interest statement
The authors declare no competing interests.
Figures















Similar articles
-
Enhancing the Predictive Performance of Molecularly Imprinted Polymer-Based Electrochemical Sensors Using a Stacking Regressor Ensemble of Machine Learning Models.ACS Sens. 2025 Apr 25;10(4):3123-3133. doi: 10.1021/acssensors.5c00364. Epub 2025 Apr 17. ACS Sens. 2025. PMID: 40241481
-
Integrating deep learning and regression models for accurate prediction of groundwater fluoride contamination in old city in Bitlis province, Eastern Anatolia Region, Türkiye.Environ Sci Pollut Res Int. 2024 Jul;31(34):47201-47219. doi: 10.1007/s11356-024-34194-w. Epub 2024 Jul 11. Environ Sci Pollut Res Int. 2024. PMID: 38990257 Free PMC article.
-
An enhanced CNN with ResNet50 and LSTM deep learning forecasting model for climate change decision making.Sci Rep. 2025 Apr 24;15(1):14372. doi: 10.1038/s41598-025-97401-9. Sci Rep. 2025. PMID: 40274930 Free PMC article.
-
Optimal features selection in the high dimensional data based on robust technique: Application to different health database.Heliyon. 2024 Sep 2;10(17):e37241. doi: 10.1016/j.heliyon.2024.e37241. eCollection 2024 Sep 15. Heliyon. 2024. PMID: 39296019 Free PMC article. Review.
-
Hyperspectral imaging as a non-destructive technique for estimating the nutritional value of food.Curr Res Food Sci. 2024 Jun 25;9:100799. doi: 10.1016/j.crfs.2024.100799. eCollection 2024. Curr Res Food Sci. 2024. PMID: 39040225 Free PMC article. Review.
Cited by
-
EGFRAP: a predictive machine learning model for assessing small molecule activity against the epidermal growth factor receptor.RSC Med Chem. 2025 Jul 10. doi: 10.1039/d5md00361j. Online ahead of print. RSC Med Chem. 2025. PMID: 40718840 Free PMC article.
-
Prediction of heavy-section ductile iron fracture toughness based on machine learning.Sci Rep. 2024 Feb 26;14(1):4681. doi: 10.1038/s41598-024-55089-3. Sci Rep. 2024. PMID: 38409441 Free PMC article.
-
PSO-XnB: a proposed model for predicting hospital stay of CAD patients.Front Artif Intell. 2024 May 3;7:1381430. doi: 10.3389/frai.2024.1381430. eCollection 2024. Front Artif Intell. 2024. PMID: 38765633 Free PMC article.
-
Predicting nepetalactone accumulation in Nepeta persica using machine learning algorithms and geospatial analysis.Sci Rep. 2025 Aug 27;15(1):31535. doi: 10.1038/s41598-025-17039-5. Sci Rep. 2025. PMID: 40866520 Free PMC article.
-
A deep learning strategy for accurate identification of purebred and hybrid pigs across SNP chips.J Anim Sci Biotechnol. 2025 Aug 14;16(1):116. doi: 10.1186/s40104-025-01249-y. J Anim Sci Biotechnol. 2025. PMID: 40813701 Free PMC article.
References
-
- Garside, M. Diamond industry statistics and facts. Diamond Industry, 2022 (accessed on 15 February 2022); https://www.statista.com/topics/1704/diamond-industry/#dossierContents__...
-
- Garside, M. Global diamond jewelry market value 2010–2020. Diamond Industry, 2021a (accessed on 15 November 2021); https://www.statista.com/statistics/585267/diamond-jewelry-market-value-....
-
- Garside, M. Global diamond jewelry market value by country 2020. Diamond Industry, 2021b (accessed on 15 November 2021) https://www.statista.com/statistics/585103/diamond-jewelry-market-value-....
-
- M.Garside. Global demand value for polished diamonds by country 2019 . Diamond Industry, 2020 (accessed on 11 November 2020) https://www.statista.com/statistics/894919/global-polished-diamond-deman....
-
- Mamonov S, Triantoro T. Subjectivity of diamond prices in online retail: Insights from a data mining study. J. Theor. Appl. Electron. Commer. Res. 2018;13(2):15–28. doi: 10.4067/S0718-18762018000200103. - DOI
LinkOut - more resources
Full Text Sources