19 June 2023

Deep learning book(Goodfellow) Chapter 8 Optimization

by Shreyas Srivastava

Empirical risk minimization

Optimization problem formulation:

Empirical minimization : Loss function expectation approximated as average of loss over batched samples(empirical loss) Surrogate loss
1-0 loss not differentiable, surrogate loss function log likelihood minimizes expected classification error

Gradient for log likelihood

\[\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})=\mathbb{E}_{\mathbf{x}, \mathrm{y} \sim \hat{p}_{\mathrm{data}}} \nabla_{\boldsymbol{\theta}} \log p_{\text {model }}(\boldsymbol{x}, y ; \boldsymbol{\theta})\]

Practical considerations of batch size

Can be estimated empirically using batched data, however standard error is proportional to \(\alpha / \sqrt(n)\) hence increasing the number of samples has a sublinear effect on the error compared to the computation.
Small batch also has a regularization effect
Batch size is usually scaled up to better utilize multi GPU/CPU. Batch size however is limited by memory available Accurate estimation of gradient requires independently drawn samples in the minibatch - consider using shuffling if data loading process has an inherent order

Challenges in Neural Network Optimization

Ill-Conditioning

$$f\left(\boldsymbol{x}^{(0)}\right)-\epsilon \boldsymbol{g}^{\top} \boldsymbol{g}+\frac{1}{2} \epsilon^2 \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g}$$

Loss function approximated by taylor series expansion. When the third time dominates, moving in the direction can actually increase the loss function
General problem with optimization

Local minima

Non convexity imposes challenges since the model might not converge on a global minima
Neural networks tend to have a large number of local minimas. Due to Model identifiability problem, generally due to weight space symmetry(hidden inputs rearrangement), relu(scale input weights by \(\alpha\) and outputs by \(1/\alpha\) have same outputs
Local minima only a problem if they are significantly degraded performance compared to the global minima.
Consensus in practitioner community today is that local minima is not a problem for neural networks since they empirically are able to achieve low cost value. It is possible to rule out local minima if the gradient norm is not decreasing over training epochs

Plateaus, Saddle Points and Other Flat Regions

For high dimensional non-convex functions critical points are more likely to be saddle points.
Motivated from probability standpoint since saddle point requires second order derivative in different eigenvalues to be mixed sign(positive and sign) rather than all positive(minima) or all negative(maxima), the expected ratio of the number of saddle points to local minima grows exponentially with n
Empirically gradient descent is shown to escape saddle points

Cliffs and exploding gradients

Gradients can explode near cliff structures and gradient clipping can prevent this from happening

Long term dependencies

Deep computational graphs with repeated params like recurrent neural networks can run into problems for eg eigendecomposition \(\boldsymbol{W}=\boldsymbol{V} \operatorname{diag}(\boldsymbol{\lambda}) \boldsymbol{V}^{-1}\) \(W^t\) will cause eigenvalues < 1 to vanish and > 1 to explode.
Vanishing gradients can halt the optimization steps while exploding gradients can cause instability in the optimization process

Inexact gradients

Sampling techniques can give noisy or biased estimate of gradient

tags: