Implicit bias of optimization algorithms and generalization of over-parameterized neural networks

Job Candidate Talk
Monday, February 6, 2023 - 2:00pm for 1 hour (actually 50 minutes)
Skiles 005, and
Chao Ma – Stanford University
Molei Tao

Please Note: Speaker will be in person, but also livestreamed but not recorded at

Modern neural networks are usually over-parameterized—the number of parameters exceeds the number of training data. In this case the loss function tends to have many (or even infinite) global minima, which imposes a challenge of minima selection on optimization algorithms besides the convergence. Specifically, when training a neural network, the algorithm not only has to find a global minimum, but also needs to select minima with good generalization among many others. We study the mechanisms that facilitate global minima selection of optimization algorithms, as well as its connection with good generalization performance. First, with a linear stability theory, we show that stochastic gradient descent (SGD) favors global minima with flat and uniform landscape. Then, we build a theoretical connection of flatness and generalization performance based on a special multiplicative structure of neural networks. Connecting the two results, we develop generalization bounds for neural networks trained by SGD. Our bounds take the optimization process into consideration. Furthermore, we study the behavior of optimization algorithms around manifold of minima and reveal the exploration of algorithms from one minimum to another.