Theoretical issues in deep networks

Tomaso Poggio¹, Andrzej Banburski², Qianli Liao²

Affiliations

¹ Center for Brains, Minds, and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139 tp@ai.mit.edu.
² Center for Brains, Minds, and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139.

PMID: 32518109
PMCID: PMC7720221
DOI: 10.1073/pnas.1907369117

Theoretical issues in deep networks

Tomaso Poggio et al. Proc Natl Acad Sci U S A. 2020.

. 2020 Dec 1;117(48):30039-30045.

doi: 10.1073/pnas.1907369117. Epub 2020 Jun 9.

Authors

Tomaso Poggio¹, Andrzej Banburski², Qianli Liao²

Affiliations

¹ Center for Brains, Minds, and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139 tp@ai.mit.edu.
² Center for Brains, Minds, and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139.

PMID: 32518109
PMCID: PMC7720221
DOI: 10.1073/pnas.1907369117

Abstract

While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization. We review our recent results toward this goal. In approximation theory both shallow and deep networks are known to approximate any continuous functions at an exponential cost. However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalized networks. The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem. Thus there is implicit regularization in training deep networks under exponential-type loss functions during gradient flow. As a consequence, the critical points correspond to minimum norm infima of the loss. This result is especially relevant because it has been recently shown that, for overparameterized models, selection of a minimum norm solution optimizes cross-validation leave-one-out stability and thereby the expected error. Thus our results imply that gradient descent in deep networks minimize the expected error.

Keywords: approximation; deep learning; generalization; machine learning; optimization.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
*Top* graphs are associated to functions. Each *Bottom* diagram (*Insets*) depicts the ideal network approximating the function above. (*Inset A*) A shallow universal network in 8 variables and $N$ units approximates a generic function of 8 variables $f (x_{1}, \dots, x_{8})$ . (*Inset B*) A hierarchical network at the bottom in $n = 8$ variables, which approximates well functions of the form $f (x_{1}, \dots, x_{8}) = h_{3} (h_{21} (h_{11} (x_{1}, x_{2}), h_{12} (x_{3}, x_{4})), h_{22} (h_{13} (x_{5}, x_{6}), h_{14} (x_{7}, x_{8})))$ as represented by the binary graph above. In the approximating network each of the $n - 1$ nodes in the graph of the function corresponds to a set of ReLU units. Similar to the shallow network, a hierarchical network is universal; that is, it can approximate any continuous function; the text discusses how it can approximate a subclass of compositional functions exponentially better than a shallow network. Redrawn from ref. .

**Fig. 2.**
Empirical evidence of a small generalization gap by normalized networks with respect to the cross-entropy loss. *Left* graph shows testing vs. training cross-entropy loss for networks each trained on the same datasets (CIFAR-10) (46) but with different initializations, yielding zero classification error on the training set but different testing errors. *Right* graph shows the same data, that is, testing vs. training loss for the same networks, now normalized by dividing each weight by the Frobenius norm of its layer. Note that all points have zero classification error at training. The red circle on the top right refers to a network trained on the same CIFAR-10 dataset but with randomized labels. It shows zero classification error at training and test error at chance level. The top line is a square-loss regression of slope 1 with positive intercept. The bottom line is the diagonal at which training and test loss are equal. The networks are three-layer convolutional networks. *Left* graph can be considered as a visualization of generalization bounds when the Rademacher complexity is not controlled. *Right* graph is a visualization of the same relation for normalized networks that is $L (ρ f (V; x)) \leq \hat{L} (ρ f (V; x)) + c_{1} ρ R_{N} (\tilde{F}) + c_{2} \sqrt{\ln (\frac{1}{δ}) / 2 N}$ for $ρ = 1$ . Under our conditions for $N$ and for the architecture of the network the terms $c_{1} R_{N} (\tilde{F}) + c_{2} \sqrt{\ln (\frac{1}{δ}) / 2 N}$ represent a small offset. Note that the exponential loss is not a better proxy for the classification loss for large $ρ$ . Empirically for these datasets, the test exponential loss with $ρ = 1$ provides a ranking that approximately agrees with the test classification ranking. From ref. .

**Fig. 3.**
Monotonic increase of the margin: the growth of the margin ${min}_{n} y_{n} f (V; x_{n})$ in the binary case. The network converges to 100% training accuracy in epoch 3 and changes the support vectors three times (red circles—the numbers above them correspond to the index of the datapoint that becomes the support vector) before stabilizing. As predicted by theory, the margin is nondecreasing. As the support vectors change, note that the rate of margin growth accelerates.

See this image and copyright information in PMC

References

1. Niyogi P., Girosi F., On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comput. 8, 819–842 (1996).
1. Poggio T., Smale S., The mathematics of learning: Dealing with data. Not. Am. Math. Soc. 50, 537–544 (2003).
1. Poggio T., Mhaskar H., Rosasco L., Miranda B., Liao Q., Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Int. J. Autom. Comput. 14, 503–519 (2017).
1. Banburski A., et al. , Theory of Deep Learning III: Dynamics and Generalization in Deep Networks (Center for Brains, Minds and Machines [CBMM], Cambridge, MA, 2019).
1. Poggio T., Liao Q., Banburski A., Complexity control by gradient descent in deep networks. Nat. Commun. 11, 1027 (2020). - PMC - PubMed

Publication types

Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Theoretical issues in deep networks

Affiliations

Theoretical issues in deep networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Research Materials