Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2016 Nov 13;374(2080):20160153.
doi: 10.1098/rsta.2016.0153.

Big data need big theory too

Affiliations
Review

Big data need big theory too

Peter V Coveney et al. Philos Trans A Math Phys Eng Sci. .

Abstract

The current interest in big data, machine learning and data analytics has generated the widespread impression that such methods are capable of solving most problems without the need for conventional scientific methods of inquiry. Interest in these methods is intensifying, accelerated by the ease with which digitized data can be acquired in virtually all fields of endeavour, from science, healthcare and cybersecurity to economics, social sciences and the humanities. In multiscale modelling, machine learning appears to provide a shortcut to reveal correlations of arbitrary complexity between processes at the atomic, molecular, meso- and macroscales. Here, we point out the weaknesses of pure big data approaches with particular focus on biology and medicine, which fail to provide conceptual accounts for the processes to which they are applied. No matter their 'depth' and the sophistication of data-driven methods, such as artificial neural nets, in the end they merely fit curves to existing data. Not only do these methods invariably require far larger quantities of data than anticipated by big data aficionados in order to produce statistically reliable results, but they can also fail in circumstances beyond the range of the data used to train them because they are not designed to model the structural characteristics of the underlying system. We argue that it is vital to use theory as a guide to experimental design for maximal efficiency of data collection and to produce reliable predictive models and conceptual knowledge. Rather than continuing to fund, pursue and promote 'blind' big data projects with massive budgets, we call for more funding to be allocated to the elucidation of the multiscale and stochastic processes controlling the behaviour of complex systems, including those of life, medicine and healthcare.This article is part of the themed issue 'Multiscale modelling at the physics-chemistry-biology interface'.

Keywords: big data; biomedicine; epistemology; machine learning; personalized medicine.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Three-dimensional plots that illustrate the way in which properties of a system (the ‘landscape’), here shown for two variables in order to be able to visualize it, can defeat machine learning approaches in some circumstances. (a) For simple situations where the landscape is relatively smooth and the number of such variables is small (two are shown here for simplicity, though the number is usually far greater), machine learning methods can be expected to do a good job of predicting behaviour over the domain in which they have been trained, just as it is easy to trace a smoothly changing slope. (b) When the landscape has more complex and rapidly varying (including e.g. fractal) character, or (c) has a ‘pathological’ form (essentially featureless except for a few controlling singularities), such learning activities are prone to fail without a level of data coverage which is too dense to be practically feasible. Extrapolation is unreliable in all instances; interpolation is also hazardous in (b,c).

References

    1. Anderson C.2008. The end of theory: the data deluge makes the scientific method obsolete. See http://www.wired.com/2008/06/pb-theory/ .
    1. Kant I.1781. Critique of pure reason. Johann Friedrich Hartknoch, Riga. Digitale Volltext-Ausgabe bei Wikisource. See https://de.wikisource.org/wiki/Hauptseite .
    1. Khoury MJ, Ioannidis JP. 2014. Medicine. Big data meets public health. Science 346, 1054–1055. (10.1126/science.aaa2709) - DOI - PMC - PubMed
    1. Bacon F, Hutchins RM, Adler MJ. 1952. Novum organum. In Great books of the western world, vol. 35 (eds Hutchins RM, Adler MJ). Chicago, IL: Encyclopædia Britannica; (Originally published 1620.)
    1. Galilei G. 1954. Dialogues concerning two new sciences. New York, NY: Dover; (Originally published 1638.)

MeSH terms