A mean field view of the landscape of two-layer neural networks

Song Mei¹, Andrea Montanari^{2

3}, Phan-Minh Nguyen⁴

Affiliations

¹ Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305.
² Department of Electrical Engineering, Stanford University, Stanford, CA 94305; montanari@stanford.edu.
³ Department of Statistics, Stanford University, Stanford, CA 94305.
⁴ Department of Electrical Engineering, Stanford University, Stanford, CA 94305.

PMID: 30054315
PMCID: PMC6099898
DOI: 10.1073/pnas.1806579115

A mean field view of the landscape of two-layer neural networks

Song Mei et al. Proc Natl Acad Sci U S A. 2018.

. 2018 Aug 14;115(33):E7665-E7671.

doi: 10.1073/pnas.1806579115. Epub 2018 Jul 27.

Authors

Song Mei¹, Andrea Montanari^{2

3}, Phan-Minh Nguyen⁴

Affiliations

¹ Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305.
² Department of Electrical Engineering, Stanford University, Stanford, CA 94305; montanari@stanford.edu.
³ Department of Statistics, Stanford University, Stanford, CA 94305.
⁴ Department of Electrical Engineering, Stanford University, Stanford, CA 94305.

PMID: 30054315
PMCID: PMC6099898
DOI: 10.1073/pnas.1806579115

Abstract

Multilayer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires optimizing a nonconvex high-dimensional objective (risk function), a problem that is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the former case, does this happen because local minima are absent or because SGD somehow avoids them? In the latter, why do local minima reached by SGD have good generalization properties? In this paper, we consider a simple case, namely two-layer neural networks, and prove that-in a suitable scaling limit-SGD dynamics is captured by a certain nonlinear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows for "averaging out" some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.

Keywords: Wasserstein space; gradient flow; neural networks; partial differential equations; stochastic gradient descent.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Evolution of the radial distribution ${\bar{ρ}}_{t}$ for the isotropic Gaussian model, with $Δ = 0.8$ . Histograms are obtained from SGD experiments with $d = 40$ , $N = 800$ , initial weight distribution $ρ_{0} = N (0, 0 . 8^{2} / d \cdot I_{d})$ , and step size $ϵ = 1 0^{- 6}$ and $ξ (t) = 1$ . Continuous lines correspond to a numerical solution of the DD (Eq. 13).

**Fig. 2.**
Population risk in the problem of separating two isotropic Gaussians, as a function of the separation parameter $Δ$ . We use a two-layer network with piecewise linear activation, no offset, and output weights equal to 1. Empirical results obtained by SGD (a single run per data point) are marked “+.” Continuous lines are theoretical predictions obtained by numerically minimizing $R (ρ)$ (see *SI Appendix* for details). Dashed lines are theoretical predictions from the single-delta ansatz of Eq. 14. Notice that this ansatz is incorrect for $Δ > Δ_{d}^{h}$ , which is marked as a solid round dot. Here, $N = 800$ .

**Fig. 3.**
Evolution of the population risk for the variable selection problem using a two-layer neural network with ReLU activations. Here $d = 320$ , $s_{0} = 60$ , and $N = 800$ , and we used $ξ (t) = t^{- 1 / 4}$ and $ε = 2 \times 1 0^{- 4}$ to set the step size. Numerical simulations using SGD (one run per data point) are marked +, and curves are solutions of the reduced PDE with $d = \infty$ . (*Inset*) Evolution of three parameters of the reduced distribution ${\bar{ρ}}_{t}$ (average output weights $a$ , average offsets $b$ , and average $ℓ_{2}$ norm in the relevant subspace $r_{1}$ ) for the same setting.

**Fig. 4.**
Separating two isotropic Gaussians, with a nonmonotone activation function (see *Predicting Failure* for details). Here $N = 800$ , $d = 320$ , and $Δ = 0.5$ . The main frame presents the evolution of the population risk along the SGD trajectory, starting from two different initializations of ${(w_{i}^{0})}_{i \leq N} \sim_{i i d} N (0, κ^{2} / d \cdot I_{d})$ for either $κ = 0.1$ or $κ = 0.4$ . In *Inset*, we plot the evolution of the average of $‖ w ‖_{2}$ for the same conditions. Symbols are empirical results. Continuous lines are predictions obtained with the reduced PDE (Eq. 13).

See this image and copyright information in PMC

References

1. Rosenblatt F. Principles of Neurodynamics. Spartan Book; Washington, DC: 1962.
1. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Vardi MY, editor. Advances in Neural Information Processing Systems. Association for Computing Machinery; New York: 2012. pp. 1097–1105.
1. Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning. Vol 1 MIT Press; Cambridge: 2016.
1. Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22:400–407.
1. Bottou L. Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G, editors. Proceedings of COMPSTAT’2010. Physica; Heidelberg: 2010. pp. 177–186.

Publication types

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A mean field view of the landscape of two-layer neural networks

Affiliations

A mean field view of the landscape of two-layer neural networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous