Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 1;117(48):30039-30045.
doi: 10.1073/pnas.1907369117. Epub 2020 Jun 9.

Theoretical issues in deep networks

Affiliations

Theoretical issues in deep networks

Tomaso Poggio et al. Proc Natl Acad Sci U S A. .

Abstract

While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization. We review our recent results toward this goal. In approximation theory both shallow and deep networks are known to approximate any continuous functions at an exponential cost. However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalized networks. The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem. Thus there is implicit regularization in training deep networks under exponential-type loss functions during gradient flow. As a consequence, the critical points correspond to minimum norm infima of the loss. This result is especially relevant because it has been recently shown that, for overparameterized models, selection of a minimum norm solution optimizes cross-validation leave-one-out stability and thereby the expected error. Thus our results imply that gradient descent in deep networks minimize the expected error.

Keywords: approximation; deep learning; generalization; machine learning; optimization.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Top graphs are associated to functions. Each Bottom diagram (Insets) depicts the ideal network approximating the function above. (Inset A) A shallow universal network in 8 variables and N units approximates a generic function of 8 variables f(x1,,x8). (Inset B) A hierarchical network at the bottom in n=8 variables, which approximates well functions of the form f(x1,,x8)=h3(h21(h11(x1,x2),h12(x3,x4)),h22(h13(x5,x6),h14(x7,x8))) as represented by the binary graph above. In the approximating network each of the n1 nodes in the graph of the function corresponds to a set of ReLU units. Similar to the shallow network, a hierarchical network is universal; that is, it can approximate any continuous function; the text discusses how it can approximate a subclass of compositional functions exponentially better than a shallow network. Redrawn from ref. .
Fig. 2.
Fig. 2.
Empirical evidence of a small generalization gap by normalized networks with respect to the cross-entropy loss. Left graph shows testing vs. training cross-entropy loss for networks each trained on the same datasets (CIFAR-10) (46) but with different initializations, yielding zero classification error on the training set but different testing errors. Right graph shows the same data, that is, testing vs. training loss for the same networks, now normalized by dividing each weight by the Frobenius norm of its layer. Note that all points have zero classification error at training. The red circle on the top right refers to a network trained on the same CIFAR-10 dataset but with randomized labels. It shows zero classification error at training and test error at chance level. The top line is a square-loss regression of slope 1 with positive intercept. The bottom line is the diagonal at which training and test loss are equal. The networks are three-layer convolutional networks. Left graph can be considered as a visualization of generalization bounds when the Rademacher complexity is not controlled. Right graph is a visualization of the same relation for normalized networks that is L(ρf(V;x))L^(ρf(V;x))+c1ρRN(F)+c2ln(1δ)/2N for ρ=1. Under our conditions for N and for the architecture of the network the terms c1RN(F)+c2ln(1δ)/2N represent a small offset. Note that the exponential loss is not a better proxy for the classification loss for large ρ. Empirically for these datasets, the test exponential loss with ρ=1 provides a ranking that approximately agrees with the test classification ranking. From ref. .
Fig. 3.
Fig. 3.
Monotonic increase of the margin: the growth of the margin minnynf(V;xn) in the binary case. The network converges to 100% training accuracy in epoch 3 and changes the support vectors three times (red circles—the numbers above them correspond to the index of the datapoint that becomes the support vector) before stabilizing. As predicted by theory, the margin is nondecreasing. As the support vectors change, note that the rate of margin growth accelerates.

References

    1. Niyogi P., Girosi F., On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comput. 8, 819–842 (1996).
    1. Poggio T., Smale S., The mathematics of learning: Dealing with data. Not. Am. Math. Soc. 50, 537–544 (2003).
    1. Poggio T., Mhaskar H., Rosasco L., Miranda B., Liao Q., Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Int. J. Autom. Comput. 14, 503–519 (2017).
    1. Banburski A., et al. , Theory of Deep Learning III: Dynamics and Generalization in Deep Networks (Center for Brains, Minds and Machines [CBMM], Cambridge, MA, 2019).
    1. Poggio T., Liao Q., Banburski A., Complexity control by gradient descent in deep networks. Nat. Commun. 11, 1027 (2020). - PMC - PubMed

Publication types