Leakage and the reproducibility crisis in machine-learning-based science

Sayash Kapoor¹, Arvind Narayanan¹

Affiliations

PMID: 37720327
PMCID: PMC10499856
DOI: 10.1016/j.patter.2023.100804

Leakage and the reproducibility crisis in machine-learning-based science

Sayash Kapoor et al. Patterns (N Y). 2023.

. 2023 Aug 4;4(9):100804.

doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.

Authors

Sayash Kapoor¹, Arvind Narayanan¹

Affiliation

¹ Department of Computer Science and Center for Information Technology Policy, Princeton University, Princeton, NJ 08540, USA.

PMID: 37720327
PMCID: PMC10499856
DOI: 10.1016/j.patter.2023.100804

Abstract

Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models.

Keywords: leakage; machine learning; reproducibility.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Survey of 22 papers that identify pitfalls in the adoption of ML methods across 17 fields, collectively affecting 294 papers In each field, papers adopting ML methods suffer from data leakage. The column headings for types of data leakage, shown in bold, are based on our taxonomy of data leakage. We also highlight other issues that are reported in the papers: (1) computational reproducibility (the lack of availability of code, data, and computing environment to reproduce the exact results reported in the paper); (2) data quality (e.g., small size or large amounts of missing data); (3) metric choice (using incorrect metrics for the task at hand, e.g., using accuracy for measuring model performance in the presence of heavy class imbalance); and (4) standard dataset use, where issues are found despite the use of standard datasets in a field.^,^,^,^,^,^,^,^,^,^,^,^,^,^,^,^,^,^,^,^,^,

**Figure 2**
The sharp increase in civil war papers that use ML methods in the last few years The number of political science papers containing the terms “civil war” and “machine learning” in the dimensions database of academic research.

**Figure 3**
A comparison of reported and corrected results in civil war prediction papers published in top political science journals The main findings of each of these papers are invalid due to various forms of data leakage: Muchlinski et al. impute the training and test data together, Colaresi and Mahmood and Wang incorrectly reuse an imputed dataset, and Kaufman et al. use proxies for the target variable that cause data leakage. When we correct these errors, complex ML models (such as Adaboost and Random Forests) do not perform substantively better than decades-old logistic regression models for civil war prediction in each case. Each column in the table outlines the impact of leakage on the results of a paper. The figure above each column shows the difference in performance that results from fixing leakage.

See this image and copyright information in PMC

References

1. Serra-Garcia M., Gneezy U. Nonreplicable publications are cited more than replicable ones. Sci. Adv. 2021;7 Publisher: American Association for the Advancement of Science Section: Research Article. - PMC - PubMed
1. Open Science Collaboration Estimating the reproducibility of psychological science. Science. 2015;349 doi: 10.1126/science.aac4716. Publisher: American Association for the Advancement of Science Section: Research Article. - DOI - PubMed
1. Hullman J., Kapoor S., Nanayakkara P., Gelman A., Narayanan A. The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning. arXiv. 2022 doi: 10.48550/arXiv.2203.06498. Preprint at. - DOI
1. Pineau, J.; Vincent-Lamarre, P.; Sinha, K.; Larivière, V.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Larochelle, H. Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program).Preprint at arXiv:https://doi.org/10.48550/arXiv.2003.122062003.12206 [cs, stat] 2020, arXiv: 2003.12206.
1. Erik Gundersen O. The fundamental principles of reproducibility. Philosophical Transactions of the Royal Society. 2021;379 Publisher: Royal Society. - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Leakage and the reproducibility crisis in machine-learning-based science

Affiliation

Leakage and the reproducibility crisis in machine-learning-based science

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources