Mini Batch Gradient Descent
Deep learning works well when there are lots of training data. However, training on large data set is slow, thus, below are one of the ways to optimize the training of a deep learning network.
Batch gradient descent
Batch gradient descent uses vectorization to process the whole data without explicit for loop. Thus, we usually stack the training data into a matrix and process them in one go. However, if we use batch gradient descent, it is slow to train on the whole data set when the data set is huge. For example, if the number of examples, m = 5M, we would need to process the 5M data before taking one little step of gradient descent.
In this case, it turns out that we can get progress faster if we split the data into mini batches and iterate on those mini batches instead of the whole data. So on the 5M training data that we have, if we split into 1K data, we will have 5K mini batches.
What's the right size of the mini batches?
- Make sure the mini batches fits in your CPU / GPU.
- Due to the way computer works, its size is usually in the power of two, e.g 64, 128, .., 512.
- 64, 128, 256, 512 are all the common sizes. 1024 is also possible, yet less common.
- If the size is 1, it becomes stochastic gradient descent.
- If ths size is the number of training data, it becomes batch gradient descent.
How about the cost function?
The cost function of batch gradient descent is strictly decreasing, however, for mini batch gradient descent, it may increase by a bit, but it should still have the tendency to decrease.
What's the benefit of this optimization algorithm?
It helps us to progress faster, in comparison to batch gradient descent which is slow for large training data and stochastic gradient descent which is slow as it loses the benefit of vectorization. Thus, this method lets us make progress without processing the whole data set and enjoy the speed up benefit of vectorization.