Despite the development of many new optimization techniques, constant step-size stochastic gradient descent (SGD) remains the tool of choice in modern machine learning. It performs unreasonably well in practice, both in convex settings where there are poor guarantees, and also in non-convex settings, where there are often no guarantees. In this talk, we will give an argument as to why it works for low-rank models, illustrating the argument for the particular case of phase retrieval.
Phase retrieval is the problem of solving systems of rank-1 quadratic equations, and can be solved by optimizing a non-convex, non-smooth objective. In order to obtain guarantees, however, a spectral initialization method is first used in order to get a constant error estimate. We will show how SGD works even from arbitrary initialization. The key insight is that the low-rank nature of the signal allows us to study the dynamics as a Markov process with a two-dimensional phase space, and in this phase space, the effective step size decreases as the dimension increases. Our analysis also introduces and makes use of several new ideas from stochastic process theory, which we believe will be of use much more broadly.