Learning to code is much more than acquiring information

Learning is a very complex process. In a very simplistic perspective a lot of people tend to think that the most important part of the process is acquiring information. This can eventually lead to a…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Learning rate schedules and decay using Keras

The learning rate is an essential ingredient for an efficient model, which has to be chosen wisely. In other words, the learning rate is an important hyperparameter to be tuned while training a neural network. As a quick introduction, the learning rate is a value used to update the weights in backpropagation with gradient descent. The weights in backpropagation are updated in order to optimize the loss function to gain a better hypothesis, whose explanation is beyond the scope of this article. The learning rate should be optimal enough so that it reaches the minimum point ( local or global minima) swiftly. Too small learning rates will delay the training by making many weight updates, whereas, too large learning rates could skip the lowest point for optimal loss function.

When a neural network is being trained, the goal is to obtain as better accuracy as the model could, and for that selecting the correct learning rate is crucial. So as the training progresses, it’s always useful to reduce the learning rate, and this can be done by using pre-defined learning rate schedules or adaptive learning rate methods.

Learning rate schedules are the methods applied to help increase the model accuracy by descending into the areas of lower loss. To see why learning rate schedules are used across to reduce the learning rate over time, let's consider the general or standard weight update formula used in neural networks.

Alpha in the above equation is the learning rate which controls the step made along the gradient. The standard initialization of learning rates could maybe as in the set below.

A neural network is then trained on the basis of the above initialization with a fixed number of epochs without changing the learning rate.

The above method is feasible but reducing the learning rate over time is always beneficial. When training the network, the aim is to find an optimal point along the gradient where the network obtains the reasonable accuracy. If we constantly keep the learning rate high, one could easily overlook the area with low loss as the steps taken are too large.

Instead, what can be done is to decrease the learning rate over time, allowing the network to take smaller steps. This decreased learning rate helps us to cover the areas with low loss and obtain the optimal point, which otherwise would have been missed by the constant learning rate.

In this article, I will cover Keras’ standard learning rate decay along with other learning rate schedules, which are, step-based, linear, and polynomial learning rate schedules.

About the CIFAR-10 dataset, it contains 60,000 32*32 color images in 10 different classes which are airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The dataset is already split into 50,000 training images and 10,000 validation images.

The following are the hyperparameters used for the training. An initial learning rate has been assigned, which is then modified by the learning rate schedules for better accuracy.

Starting with, the ResNet model is trained on CIFAR-10 with no learning rate schedule or decay (constant learning rate throughout)

Here the accuracy obtained is around 85%, but the validation loss and accuracy remain constant after epoch 15 and do not improve for the rest of the 100 epochs. The objective here is to improve the accuracy by modifying the learning rate over time using learning rate schedules.

The Keras library provides a time-based learning rate schedule, which is controlled by the decay parameter of the optimizer class of Keras ( SGD, Adam, etc)

Below is the initialization of the ResNet architecture and the SGD optimizer with the decay parameter.

The decay in the SGD optimizer is set to be the learning rate divided by the number of epochs used to train the network.

Keras applies the following learning rate schedule internally, which updates the learning rate after every batch update.

The update formula of learning rate is

At the beginning of epoch one, learning rate will be modified to

lr = 0.01 * 1.0/(1.0+0.0001 * (1*782)) = 0.00927

Similarly, at the beginning of epoch two, it would be

lr = 0.01 * 1.0/(1.0+0.0001 * (2*782)) = 0.00864

And so on, for 100 epochs, the learning rate gradually decreases helping the network to gain optimal loss with better accuracy.

Below is the depiction of the accuracy/loss curves of the ResNet model trained on CIFAR-10 dataset with Keras’ standard learning rate decay.

The accuracy in the above curve is around 82% which shows that even with learning rate scheduler the accuracy cannot be improved always, so an appropriate learning rate scheduler/decay has to be chosen.

One of the learning rate schedulers is step-based decay where the learning rate is systematically dropped after a set of epochs during training. This decay can be understood as a piecewise function, where the learning rate is constant for a number of epochs, then drops, then constant again, etc.

The learning rate in step decay can drop by either half or any order of magnitude after every fixed number of epochs. Since we are training with an initial learning rate of 0.01, let’s see how step-decay works.

After ten epochs, if the learning rate is dropped by a factor of 0.5, new learning rate would be 0.005. Similarly, after another ten epochs (20 epochs) learning rate would drop to 0.0025

The above learning rate schedule can be depicted in the following figure.

The above figure illustrates two different drops in the learning rate with respective drop factors. They are dropped by a factor of 0.5 (red line) and a factor of 0.25 (blue line).

Step-based decay equation can be defined as:

Where F is the factor value that controls the rate of a learning rate drop, D is the “drop every” epochs value, and E is the current epoch. Larger the factor value F is, the slower the learning rate would decay. Conversely, the smaller the factor value F is, the faster the learning rate would decay to optimal.

Following is the accuracy/loss curves when trained with a step-based decay learning rate schedule.

As you can see, the training/validation loss decreased and training/validation accuracy increased when are learning rate is dropped, which demonstrates the proper functionality of a step-based learning rate scheduler. The kind of steep in the figure shows the sign of a step-based learning rate schedule being utilized. The accuracy is around 87% which is a better improvement than the previous ones.

The rate in which the learning rate is decayed in these schedules is purely based on the parameters used for the polynomial function. Using these schedules, the learning rate can be decayed to zero. A smaller exponent will cause the learning rate to decay more slowly, conversely, a larger exponent will cause the learning rate to decay more faster.

The difference between Linear and Polynomial schedules lies in the value of the exponent if the exponent is 1.0 then the scheduler is the linear learning rate decay.

Let’s visualize the output curves of the model when trained with a linear learning rate schedule. For the linear learning rate, the exponent of the polynomial function would be 1.0.

As seen in the above figure, there is a heavy drop in the training and validation loss, in fact, the training loss is dropping more significantly than the validation loss, which may result in overfishing. The accuracy with linear learning rate schedule is around 88%, which is better accuracy compared to the previous ones.

Now let’s visualize the training history of the network when trained with a polynomial learning rate schedule. For the polynomial learning rate, let’s assume the exponent of the polynomial function as 5.0.

Here the learning rate decays to the lowest according to the polynomial function, due to which our loss also drops significantly. The accuracy with the polynomial learning rate schedule is around 86%.

In the above experiment, when training the ResNet model on the CIFAR-10 dataset, the best or highest accuracy of 88% was obtained when a linear learning rate scheduler was used. This doesn’t mean that better accuracy is always obtained with linear learning rate scheduler, if hyperparameters used in the above experiment are replaced or fine-tuned one of the other schedulers might give better accuracy.

Add a comment

Related posts:

How to render your first HTML locally on a browser

Learning to code from scratch can be very daunting for a beginner. This is why I came up with this basic to guide to help advance just anybody’s baby steps to confident strides in the shortest…

Dejad de preguntarle a mi hijo si tiene novia

El tema de el amor y los niños es uno de los favoritos de los adultos. El otro día, disfrutando de la nueva anormalidad de paseo por una plaza Coruñesa, me reencontré de lleno con uno de mis viejos…

Service Types in Kubernetes?

In kubernetes a service always enables its network access to a pod or set of pods Services will select the pods based on their labels and when a network is made to those services it selects all Pods…