The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima

doi:10.1073/pnas.2015617118

. 2021 Mar 2;118(9):e2015617118.

doi: 10.1073/pnas.2015617118.

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima

Yu Feng^{1

2}, Yuhai Tu³

Affiliations

¹ Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598.
² Department of Physics, Duke University, Durham, NC 27710.
³ Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598; yuhai@us.ibm.com.

PMID: 33619091
PMCID: PMC7936325
DOI: 10.1073/pnas.2015617118

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima

Yu Feng et al. Proc Natl Acad Sci U S A. 2021.

. 2021 Mar 2;118(9):e2015617118.

doi: 10.1073/pnas.2015617118.

Authors

Yu Feng^{1

2}, Yuhai Tu³

Affiliations

¹ Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598.
² Department of Physics, Duke University, Durham, NC 27710.
³ Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598; yuhai@us.ibm.com.

PMID: 33619091
PMCID: PMC7936325
DOI: 10.1073/pnas.2015617118

Abstract

Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

Keywords: generalization; loss landscape; machine learning; statistical physics; stochastic gradient descent.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
The PCA results and the drift–diffusion motion in SGD. (A) The rank-ordered variance $σ_{i}^{2}$ in different principal component (PC) directions $i$ . For $i \geq 20$ , $σ_{i}^{2}$ decreases with $i$ as a power law $i^{- γ}$ with $γ \sim 2 - 3$ . (B) The normalized accumulative variance of the top $(n - 1)$ PCs excluding $i = 1$ . It reaches $\sim 90 %$ at $n = 35$ much smaller than the total number of weights $N = 2,500$ between the two hidden layers. (C) The SGD weight trajectory projected onto the $(θ_{1}, θ_{2})$ plane. The persistent drift motion in $θ_{1}$ and the diffusive random motion in $θ_{2}$ are clearly shown. (D) The diffusive motion in the $(θ_{i}, θ_{j})$ plane with $j > i (\neq 1)$ randomly chosen ( $i = 49$ and $j = 50$ shown here). Unless otherwise stated, hyperparameters used are $B = 50$ , $α = 0.1$ .

**Fig. 2.**
The loss function landscape and the inverse variance–flatness relation. (A) The loss function profile $L_{i}$ along the $i$ th PCA direction. (B) The loss landscape (in log-scale). $L_{i}$ can be fitted better by an inverse Gaussian (the red dashed line) than a quadratic function (the green dashed line). The definition of the flatness $F_{i} (\equiv θ_{i}^{r} - θ_{i}^{l})$ is also shown (see text for details). (C and D) The flatness $F_{i}$ for different PCA directions $i$ (C) and the inverse relation between the variance $σ_{i}^{2}$ and the flatness $F_{i}$ for different choices of minibatch size $B$ and learning rate $α$ (D).

**Fig. 3.**
Statistical properties of the MLF ensemble. (A) Profiles of the overall loss function $\ln (L_{i})$ (red line) and a set of randomly chosen MLFs $\ln (L_{i}^{μ})$ (blue dashed lines) in a given PCA direction $i$ . (B) The inverse dependence of $D_{i}$ and $τ_{i}$ on the flatness $F_{i}$ .

**Fig. 4.**
The landscape-dependent constraints for avoiding catastrophic forgetting. (A) The test errors for task 1 ( $ϵ_{1}$ ) and task 2 ( $ϵ_{2}$ ) versus training time for task 2 in the absence of the constraints ( $λ = 0$ ). (B) The weight displacements $q_{i}$ in different PCA directions ${\vec{p}}_{1 i}$ from task 1 in the absence of the constraints ( $λ = 0$ ). C and D are the same as A and B but in the presence of the constraints with $λ = 10$ and $N_{c} = 200$ . The red dashed line in D shows the upper bound $q_{i} ≲ 0.008 F_{i 1}$ for the modes ( $i \leq N_{c}$ ) that are under constraint. (E) The tradeoff between the saturated test errors ( $ϵ_{1}$ and $ϵ_{2}$ ) when varying $λ$ for LDC (blue circles) and EWC (red squares) algorithms. (F) The overall performance ( $ϵ_{1} + ϵ_{2}$ ) versus the number of constrains $N_{c}$ for LDC (blue circles) and EWC (red squares) algorithms. The two tasks are for classifying two separate digit pairs [ $(0,1)$ for task 1 and $(2,3)$ for task 2] in MNIST.

**Fig. 5.**
Profiles and dynamics of the anisotropic active temperature. (A) The active temperature profile $T_{i} (δ θ, t)$ in the $i$ th PCA direction at $t = 200$ . (B) The minimum active temperature $T_{i} (0)$ in different PCA directions $i$ . *Inset* shows the inverse dependence of $T_{i}$ on the flatness $F_{i}$ . (C) The active temperature profiles $T_{i} (δ θ, t)$ at different times for $i = 10$ . (D) The active temperature $T_{i}$ for all directions decreases with time in sync with the loss function (red line) dynamics. The shaded region highlights the transition between the fast-learning phase and the exploration phase. *Inset* shows the correlation between $T_{i}$ and $L$ .

**Fig. 6.**
The flatness spectrum and the effective dimension ( $D_{s}$ ) of solution. (A) The flatness spectra (rank-ordered flatness) for networks with different width ( $H$ ). (B) The effective dimension of the solution $D_{s}$ , which is defined as the number of directions whose flatness is below a threshold set to be roughly half of the $L_{2}$ norm of the weights (the dashed line in A), increases weakly as the number of parameters (weights) $N_{p} (\equiv H^{2})$ increases. The error bars are obtained by using 10 different solutions obtained by 10 random initializations with the same norm for each network size.

See this image and copyright information in PMC

Cited by

On the different regimes of stochastic gradient descent.
Sclocchi A, Wyart M. Sclocchi A, et al. Proc Natl Acad Sci U S A. 2024 Feb 27;121(9):e2316301121. doi: 10.1073/pnas.2316301121. Epub 2024 Feb 20. Proc Natl Acad Sci U S A. 2024. PMID: 38377198 Free PMC article.
Topology, vorticity, and limit cycle in a stabilized Kuramoto-Sivashinsky equation.
Chen YC, Shi C, Kosterlitz JM, Zhu X, Ao P. Chen YC, et al. Proc Natl Acad Sci U S A. 2022 Dec 6;119(49):e2211359119. doi: 10.1073/pnas.2211359119. Epub 2022 Dec 2. Proc Natl Acad Sci U S A. 2022. PMID: 36459639 Free PMC article.
Machine learning meets physics: A two-way street.
Levine H, Tu Y. Levine H, et al. Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2403580121. doi: 10.1073/pnas.2403580121. Epub 2024 Jun 24. Proc Natl Acad Sci U S A. 2024. PMID: 38913898 Free PMC article. No abstract available.
Thermodynamics of the Ising Model Encoded in Restricted Boltzmann Machines.
Gu J, Zhang K. Gu J, et al. Entropy (Basel). 2022 Nov 22;24(12):1701. doi: 10.3390/e24121701. Entropy (Basel). 2022. PMID: 36554106 Free PMC article.
Understanding cytoskeletal avalanches using mechanical stability analysis.
Floyd C, Levine H, Jarzynski C, Papoian GA. Floyd C, et al. Proc Natl Acad Sci U S A. 2021 Oct 12;118(41):e2110239118. doi: 10.1073/pnas.2110239118. Proc Natl Acad Sci U S A. 2021. PMID: 34611021 Free PMC article.

See all "Cited by" articles

References

1. LeCun Y., Bengio Y., Hinton G., Deep learning. Nature 521, 436–444 (2015). - PubMed
1. Robbins H., Monro S., A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951).
1. Bottou L. “Large-scale machine learning with stochastic gradient descent” in Proceedings of COMPSTAT’2010, Lechevallier Y., Saporta G., Eds. (Physica-Verlag HD, Heidelberg, Germany, 2010), pp. 177–186.
1. Hinton G. E., van Camp D., “Keeping the neural networks simple by minimizing the description length of the weights” in Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT ‘93, L. Pitt, Ed. (ACM, New York, NY, 1993), pp. 5–13.
1. Hochreiter S., Schmidhuber J., Flat minima. Neural Comput. 9, 1–42 (1997). - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] LeCun Y., Bengio Y., Hinton G., Deep learning. Nature 521, 436–444 (2015). - PubMed

[2] LeCun Y., Bengio Y., Hinton G., Deep learning. Nature 521, 436–444 (2015). - PubMed

[3] Robbins H., Monro S., A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951).

[4] Robbins H., Monro S., A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951).

[5] Bottou L. “Large-scale machine learning with stochastic gradient descent” in Proceedings of COMPSTAT’2010, Lechevallier Y., Saporta G., Eds. (Physica-Verlag HD, Heidelberg, Germany, 2010), pp. 177–186.

[6] Bottou L. “Large-scale machine learning with stochastic gradient descent” in Proceedings of COMPSTAT’2010, Lechevallier Y., Saporta G., Eds. (Physica-Verlag HD, Heidelberg, Germany, 2010), pp. 177–186.

[7] Hinton G. E., van Camp D., “Keeping the neural networks simple by minimizing the description length of the weights” in Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT ‘93, L. Pitt, Ed. (ACM, New York, NY, 1993), pp. 5–13.

[8] Hinton G. E., van Camp D., “Keeping the neural networks simple by minimizing the description length of the weights” in Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT ‘93, L. Pitt, Ed. (ACM, New York, NY, 1993), pp. 5–13.

[9] Hochreiter S., Schmidhuber J., Flat minima. Neural Comput. 9, 1–42 (1997). - PubMed

[10] Hochreiter S., Schmidhuber J., Flat minima. Neural Comput. 9, 1–42 (1997). - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima

Affiliations

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources