Classification-Based Detection and Quantification of Cross-Domain Data Bias in Materials Discovery

Giovanni Trezza¹, Eliodoro Chiavazzo¹

Affiliations

PMID: 39681303
DOI: 10.1021/acs.jcim.4c01766

Classification-Based Detection and Quantification of Cross-Domain Data Bias in Materials Discovery

Giovanni Trezza et al. J Chem Inf Model. 2025.

. 2025 Feb 24;65(4):1747-1761.

doi: 10.1021/acs.jcim.4c01766. Epub 2024 Dec 16.

Authors

Giovanni Trezza¹, Eliodoro Chiavazzo¹

Affiliation

¹ Department of Energy, Politecnico di Torino, C.so Duca degli Abruzzi 24, Torino 10129, Italy.

PMID: 39681303
DOI: 10.1021/acs.jcim.4c01766

Abstract

It stands to reason that the amount and the quality of data are of key importance for setting up accurate artificial intelligence (AI)-driven models. Among others, a fundamental aspect to consider is the bias introduced during sample selection in database generation. This is particularly relevant when a model is trained on a specialized data set to predict a property of interest and then applied to forecast the same property over samples having a completely different genesis. Indeed, the resulting biased model will likely produce unreliable predictions for many of those out-of-the-box samples, i.e., samples out of the training set. Neglecting such an aspect may hinder the AI-based discovery process, even when high-quality, sufficiently large, and highly reputable data sources are available. To address this challenge, we propose a new method that detects and quantifies data bias, reducing its impact on materials discovery. Our approach, aimed at identifying and excluding those out-of-the-box materials for which the predictions of a pretrained model are likely unreliable, leverages a classification strategy and is validated by means of superconductor and thermoelectric materials as two representative case studies. This methodology, designed to be simple, flexible, and easily adaptable to any architecture, including modern graph equivariant neural networks, aims to enhance the reliability of AI models when applied to diverse and previously unseen materials, thereby contributing to more reliable AI-driven materials discovery.

PubMed Disclaimer

MeSH terms

Actions
Actions

LinkOut - more resources

Full Text Sources
- American Chemical Society

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Classification-Based Detection and Quantification of Cross-Domain Data Bias in Materials Discovery

Affiliation

Classification-Based Detection and Quantification of Cross-Domain Data Bias in Materials Discovery

Authors

Affiliation

Abstract

MeSH terms

LinkOut - more resources

Full Text Sources