Archetypal landscapes for deep neural networks

Philipp C Verpoort¹, Alpha A Lee², David J Wales³

Affiliations

¹ Department of Physics, University of Cambridge, Cambridge CB3 0HE, United Kingdom; pcv22@cam.ac.uk.
² Department of Physics, University of Cambridge, Cambridge CB3 0HE, United Kingdom.
³ Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, United Kingdom.

PMID: 32843349
PMCID: PMC7486703
DOI: 10.1073/pnas.1919995117

Archetypal landscapes for deep neural networks

Philipp C Verpoort et al. Proc Natl Acad Sci U S A. 2020.

. 2020 Sep 8;117(36):21857-21864.

doi: 10.1073/pnas.1919995117. Epub 2020 Aug 25.

Authors

Philipp C Verpoort¹, Alpha A Lee², David J Wales³

Affiliations

¹ Department of Physics, University of Cambridge, Cambridge CB3 0HE, United Kingdom; pcv22@cam.ac.uk.
² Department of Physics, University of Cambridge, Cambridge CB3 0HE, United Kingdom.
³ Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, United Kingdom.

PMID: 32843349
PMCID: PMC7486703
DOI: 10.1073/pnas.1919995117

Abstract

The predictive capabilities of deep neural networks (DNNs) continue to evolve to increasingly impressive levels. However, it is still unclear how training procedures for DNNs succeed in finding parameters that produce good results for such high-dimensional and nonconvex loss functions. In particular, we wish to understand why simple optimization schemes, such as stochastic gradient descent, do not end up trapped in local minima with high loss values that would not yield useful predictions. We explain the optimizability of DNNs by characterizing the local minima and transition states of the loss-function landscape (LFL) along with their connectivity. We show that the LFL of a DNN in the shallow network or data-abundant limit is funneled, and thus easy to optimize. Crucially, in the opposite low-data/deep limit, although the number of minima increases, the landscape is characterized by many minima with similar loss values separated by low barriers. This organization is different from the hierarchical landscapes of structural glass formers and explains why minimization procedures commonly employed by the machine-learning community can navigate the LFL successfully and reach low-lying solutions.

Keywords: deep learning; energy landscapes; neural networks; optimization; statistical mechanics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
A DNN. Blue, input; red, output; green, hidden nodes.

**Fig. 2.**
Disconnectivity graphs for $N_{d a t a} \in$ {100, 1,000, 2,000} training data (from top to bottom) for the DNNs with $H \in {1,2,3}$ hidden layers (from left to right), labeled as “DATASET-#HL-#” on the top of each panel, where the two digits indicate, respectively, the number of hidden layers, $H$ , and the amount of training data, $N_{d a t a}$ . Only the lowest 2,000 minima (or all, if fewer than 2,000 were identified) are shown, and the vertical scale has been adjusted to span the range of loss-function values within this set. Included as *Insets* below each disconnectivity graph are graphical visualizations of the performance of the global minimum (see *SI Appendix*, section S2 for details), as well as a plot of the training (horizontal axis) versus testing (vertical axis) loss values of all minima. It is apparent from these graphs that, in each case, the structure of the LFL is either funneled or comprises many minima with similar loss values connected by low barriers.

**Fig. 3.**
Continuation of Fig. 2 with $N_{d a t a} \in {10,000,100,000}$ .

**Fig. 4.**
Disconnectivity graphs for the training datasets OPTDIG (with $N_{d a t a} \in$ {1,500, 5,000} and $H \in {1,3}$ ) and WINE (with $N_{d a t a}$ = 1,500 and $H \in {1,3}$ ). Only the lowest 2,000 minima (or all of them if fewer than 2,000 were found) are shown. The vertical scale is adjusted to span the range of loss-function values within this set.

See this image and copyright information in PMC

References

1. Silver D., et al. , A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, 1140–1144, (2018). - PubMed
1. Song M., Montanari A., Nguyen P.-M., A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. U.S.A. 115, 7665–7671 (2018). - PMC - PubMed
1. Choromanska A., Henaff M. B., Mathieu M., Ben Arous G., LeCun Y., “The loss surfaces of multilayer networks” in Proceedings of Machine Learning Research (PMLR), Lebanon G., Vishwanathan S. V. N., Eds. (PMLR, Cambridge, MA, 2015) vol. 38, pp. 192–204.
1. Hochreiter S., Schmidhuber J., “Simplifying neural nets by discovering flat minima” in NIPS’94: Proceedings of the 7th International Conference on Neural Information Processing Systems, Tesauro G., Touretzky D. S., Leen T. K., Eds. (MIT Press, Cambridge, MA, 1995), pp. 529–536.
1. Hochreiter S., Schmidhuber J., Flat minima. Neural Comput. 9, 1–42 (1997). - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Archetypal landscapes for deep neural networks

Affiliations

Archetypal landscapes for deep neural networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources