Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 4;4(9):100804.
doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.

Leakage and the reproducibility crisis in machine-learning-based science

Affiliations

Leakage and the reproducibility crisis in machine-learning-based science

Sayash Kapoor et al. Patterns (N Y). .

Abstract

Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models.

Keywords: leakage; machine learning; reproducibility.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Survey of 22 papers that identify pitfalls in the adoption of ML methods across 17 fields, collectively affecting 294 papers In each field, papers adopting ML methods suffer from data leakage. The column headings for types of data leakage, shown in bold, are based on our taxonomy of data leakage. We also highlight other issues that are reported in the papers: (1) computational reproducibility (the lack of availability of code, data, and computing environment to reproduce the exact results reported in the paper); (2) data quality (e.g., small size or large amounts of missing data); (3) metric choice (using incorrect metrics for the task at hand, e.g., using accuracy for measuring model performance in the presence of heavy class imbalance); and (4) standard dataset use, where issues are found despite the use of standard datasets in a field.,,,,,,,,,,,,,,,,,,,,,
Figure 2
Figure 2
The sharp increase in civil war papers that use ML methods in the last few years The number of political science papers containing the terms “civil war” and “machine learning” in the dimensions database of academic research.
Figure 3
Figure 3
A comparison of reported and corrected results in civil war prediction papers published in top political science journals The main findings of each of these papers are invalid due to various forms of data leakage: Muchlinski et al. impute the training and test data together, Colaresi and Mahmood and Wang incorrectly reuse an imputed dataset, and Kaufman et al. use proxies for the target variable that cause data leakage. When we correct these errors, complex ML models (such as Adaboost and Random Forests) do not perform substantively better than decades-old logistic regression models for civil war prediction in each case. Each column in the table outlines the impact of leakage on the results of a paper. The figure above each column shows the difference in performance that results from fixing leakage.

References

    1. Serra-Garcia M., Gneezy U. Nonreplicable publications are cited more than replicable ones. Sci. Adv. 2021;7 Publisher: American Association for the Advancement of Science Section: Research Article. - PMC - PubMed
    1. Open Science Collaboration Estimating the reproducibility of psychological science. Science. 2015;349 doi: 10.1126/science.aac4716. Publisher: American Association for the Advancement of Science Section: Research Article. - DOI - PubMed
    1. Hullman J., Kapoor S., Nanayakkara P., Gelman A., Narayanan A. The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning. arXiv. 2022 doi: 10.48550/arXiv.2203.06498. Preprint at. - DOI
    1. Pineau, J.; Vincent-Lamarre, P.; Sinha, K.; Larivière, V.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Larochelle, H. Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program).Preprint at arXiv:https://doi.org/10.48550/arXiv.2003.122062003.12206 [cs, stat] 2020, arXiv: 2003.12206.
    1. Erik Gundersen O. The fundamental principles of reproducibility. Philosophical Transactions of the Royal Society. 2021;379 Publisher: Royal Society. - PubMed

LinkOut - more resources