Introduction
It has been commonly observed that when we are training a Model especially CNN Model then the loss fluctuates a lot. In this article we will understand the reasons for this fluctuation and how can you minimize this fluctuation.
Reasons for Fluctuations in Loss During Training
There are several reasons that can cause fluctuations in training loss over epochs.
The main one though is the fact that almost all neural nets are trained with different forms of gradient decent variants such as SGD, Adam etc. which causes oscillations during descent.
If you use all the samples for each update, you should see loss decreasing and finally reaching a limit, but this will take lot of time and computational power. This is why batch_size parameter exists which determines how many samples you want to use for one update to the model parameters, but with this batch feature fluctuation in loss and accuracy comes along with it.
Very small batch_size
Second reason for fluctuation is when we use very small batch_size. So it’s like you are trusting every small portion of the data points.
Now let’s assume that within your data points, you have a mislabelled sample. This sample when combined with 2-3 even properly labelled samples, can result in an update which does not decrease the global loss, but increase the loss and move the descent away from a local minima.
Solution – When the batch_size is larger, such effects would be reduced. Along with other reasons, it’s good to have batch_size higher than some minimum. Having it too large would also make training go slow with increased memory requirements. Therefore, batch_size is treated as a hyper-parameter with optimal value.
Large network, small dataset
Another reason would If you are training a relatively large network with 100K+ parameters with a very small number of samples, says just 100 samples.
In other words, if you want to learn 100K parameters or find a good local minimum in a 100K-D space using only 100 samples, it would be very difficult and you would end up with lots of fluctuations in loss or accuracy rather than coming down on a good local minima.
Solution – We should use network with less parameters (i.e. light network) if we have small sample size or increase the sample size.
If you want to look at the code example of how you can train a model for object detection and image classification then you can go to Training a CNN model from scratch using custom dataset.