Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 11;12(1):19350.
doi: 10.1038/s41598-022-23327-1.

Benchmarking AutoML for regression tasks on small tabular data in materials design

Affiliations

Benchmarking AutoML for regression tasks on small tabular data in materials design

Felix Conrad et al. Sci Rep. .

Abstract

Machine Learning has become more important for materials engineering in the last decade. Globally, automated machine learning (AutoML) is growing in popularity with the increasing demand for data analysis solutions. Yet, it is not frequently used for small tabular data. Comparisons and benchmarks already exist to assess the qualities of AutoML tools in general, but none of them elaborates on the surrounding conditions of materials engineers working with experimental data: small datasets with less than 1000 samples. This benchmark addresses these conditions and draws special attention to the overall competitiveness with manual data analysis. Four representative AutoML frameworks are used to evaluate twelve domain-specific datasets to provide orientation on the promises of AutoML in the field of materials engineering. Performance, robustness and usability are discussed in particular. The results lead to two main conclusions: First, AutoML is highly competitive with manual model optimization, even with little training time. Second, the data sampling for train and test data is of crucial importance for reliable results.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Data processing pipeline for a typical data mining workflow in materials design, starting with a material dataset and ending with model testing. Steps within the highlighted space can be automated by use of AutoML tools.
Figure 2
Figure 2
Design space visualisation from chosen datasets. (a) Histogram of one target value normalized via standard scaler. (b) Representation of the standardized input feature space for the selected datasets via visualization of the first two principal components , the color of the points represents the target value.
Figure 3
Figure 3
Workflow for the evaluation process.
Figure 4
Figure 4
The mean relative scores of the four tested AutoML frameworks and the best AutoML aggregation per training time. (a) Mean relative score based on R2 (b) Mean relative score based on RMSE.
Figure 5
Figure 5
The relative score from the outer splits per task. Relative score means MAErel for Matbench-steels, MAPErel for Xiong and Rrel2 otherwise. For Hu and Koya the literature provides a performance range, represented by a black “error bar”.
Figure 6
Figure 6
The performance R2 of with respect to the dataset size and shape, one box represents all outer loop runs of one dataset. (a) R2 over dataset size, the boxes are slightly shifted to avoid overlapping, without affecting the interpretation of the graphic. (b) R2 over dataset size divided by number of features.

References

    1. Wei J, et al. Machine learning in materials science. InfoMat. 2019;1:338–358. doi: 10.1002/inf2.12028. - DOI
    1. Xin, D., Wu, E. Y., Lee, D. J.-L., Salehi, N. & Parameswaran, A. Whither automl? Understanding the role of automation in machine learning workflows. in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–16. 10.1145/3411764.3445306 (2021).
    1. Kaggle. State of data science and machine learning 2021 (2022).
    1. Karmaker SK, et al. Automl to date and beyond: Challenges and opportunities. ACM Comput. Surveys (CSUR) 2021;54:1–36. doi: 10.1145/3470918. - DOI
    1. Lei B, et al. Bayesian optimization with adaptive surrogate models for automated experimental design. NPJ Comput. Mater. 2021;7:1–12. doi: 10.1038/s41524-021-00662-x. - DOI

LinkOut - more resources