In recent years, non-convex optimization problems are more often described by
generalized
(L0,L1)-smoothness assumption rather than standard one.
Meanwhile, severely corrupted data used in these problems has increased the
demand for methods capable of handling heavy-tailed noises, i.e., noises with
bounded
κ-th moment. Motivated by these real-world trends and
challenges, we explore sign-based methods in this setup and demonstrate their
effectiveness in comparison with other popular solutions like clipping or
normalization.
In theory, we prove the first-known high probability convergence bounds under
(L0,L1)-smoothness and heavy-tailed noises with mild parameter
dependencies. In the case of standard smoothness, these bounds are novel for
sign-based methods as well. In particular, SignSGD with batching achieves
sample complexity $\tilde{O}\left(\left(\frac{\Delta L_0d}{\varepsilon^2} +
\frac{\Delta L_1d^\frac{3}{2}}{\varepsilon}\right)\left[1 +
\left(\frac{\sigma}{\varepsilon}\right)^\frac{\kappa}{\kappa-1}\right]\right),
\kappa \in (1,2]$. Under the assumption of symmetric noises, SignSGD with
Majority Voting can robustly work on the whole range of
κ∈(0,2] with
complexity $\tilde{O}\left(\left(\frac{\Delta L_0d}{\varepsilon^2} +
\frac{\Delta L_1d^\frac{3}{2}}{\varepsilon}\right)\left[\frac{1}{\kappa^2} +
\frac{\sigma^2}{\varepsilon^2}\right]\right)$. We also obtain results for
parameter-agnostic setups, Polyak-Lojasiewicz functions and momentum-based
methods (in expectation). Our theoretical findings are supported by the
superior performance of sign-based methods in training Large Language Models
compared to clipping and normalization.