CS 331 Stochastic Gradient Descent Methods

Stochastic gradient descent (SGD) in one or another of its many variants is the workhorse method for training modern supervised machine learning models. However, the world of SGD methods is vast and expanding, which makes it hard for practitioners and even experts to understand its landscape and inhabitants. This course is a mathematically rigorous and comprehensive introduction to the field, and is based on the latest results and insights. The course develops a convergence and complexity theory for serial, parallel, and distributed variants of SGD, in the strongly convex, convex and nonconvex setup, with randomness coming from sources such as subsampling and compression. Additional topics such as acceleration via Nesterov momentum or curvature information will be covered as well. A substantial part of the course offers a unified analysis of a large family of variants of SGD which have so far required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. This framework includes methods with and without the following tricks, and their combinations: variance reduction, data sampling, coordinate sampling, arbitrary sampling, importance sampling, mini-batching, quantization, sketching, dithering and sparsification.

Credits

Prerequisite

1. Strong experience with at least one high level computing language (e.g.: Python, Julia, C, MATLAB) 2. Mathematical maturity (i.e., ability to comprehend and generate proofs) 3. Linear algebra (abstract vector spaces, linear independence, basis, linear operators, quadratic forms, Euclidean spaces, inner product, norm, ...) 4. Matrix theory (matrices, determinants, singular values, eigenvalues, matrix decompositions, ...) 5. Multivariate calculus (gradient, Hessian, Taylor approximation, chain rule, ...) 6. Probability theory (probability spaces, expectation, law of large numbers, tower property of expectation, ...)