Bits and Bytes

Shreyas Srivastava

4 February 2023

Work in progress Notes CIFAR10 resnet exploration

by Shreyas Srivastava

Study notes blog post are based on the blog post How to train your resnet series

Learning rate schedule

learn-rate

When we try to increase the learning rate, it could lead to degradation in two ways

Tuning learning rate and batch size

batch-size

The above graph shows:

We can see from the below graph that the learning rate has a drastic effect on the forgetting phenomenon. When learning rate is very high, we get a comparable validation loss with 1) half dataset configuration + no augmentations and 2) full dataset + augmentations which is unexpected

learn-rate-half learn-rate-half

Weight decay in presence of batch normalization

How to interpret Weight decay in presence of the batch normalization?

Weight decay is used as a regularization technique to lower the weight norm and prevent overfitting. However, in the example case where the convolution layer is followed by the batch normalization layer, weights are rescaled by the batch norm layer. In this case, loss function is independent of the weight norm. However, the blog post linked shows why weight decay serves an important function in the optimization process.

Essentially the weight update equation can be split into a weight decay portion and weight update through gradient descent step. Imagine if we rescale and increase the weights by 2x, the effective gradient will be 2x smaller. Additionally, in the 2x scaled regime, in order to maintain a similar optimization trajectory the gradient needs to be increased by 2x.

In presence of batch normalization the loss function is scale invariant wrt to the weights As a qualitative argument, weight decay acts as a control mechanism in presence of batch norm and maintains the weights:gradient ratio.

tags: