RMS Prop & Adam Optimization
RMS Prop
RMS Prop stands for Root Mean Square Prop, which can also speed up gradient descent.
Implementation
On iteration t:Compute dW, db on the current mini batchSdw = beta * Sdw + (1 - beta) * dW^2Sdb = beta * Sdb + (1 - beta) * db^2W = W - alpha * dW / sqrt(Sdw)b = b - alpha * db / sqrt(Sdb)
*To handle rounding and division by zero, we usually replace sqrt(Sdw)
and sqrt(Sdb)
with
sqrt(Sdw + epsilon)
and sqrt(Sdb + epsilon)
respectively.
Adam Optimization
Most of the optimizations fail to generalize, RMS prop and Adam optimization are a few optimization algorithms that has been proven to work on wide range of algorithms.
Adam optimization algorithm is basically taking momentum and RMS prop together.
Adam: Adaption moment estimation
Implementation
Vdw = 0, Sdw = 0, Vdb = 0, Sdb = 0On iteration t:Compute dW. db using the current mini batchVdw = beta1 * Vdw + (1 - beta1) * dWVdb = beta1 * Vdb + (1 - beta1) * dbSdw = beta2 * Sdw + (1 - beta2) * dW^2Sdb = beta2 * Sdb + (1 - beta2) * db^2Vdw_corrected = Vdw / (1 - beta1^t)Vdb_corrected = Vdb / (1 - beta1^t)Sdw_corrected = Sdw / (1 - beta2^t)Sdb_corrected = Sdb / (1 - beta2^t)W = W - alpha * Vdw_corrected / sqrt(Sdw_corrected + epsilon)b = b - alpha * Vdb_corrected / sqrt(Sdb_corrected + epsilon)
Hyperparameters choice
alpha : needs to be tunedbeta1 : 0.9beta2 : 0.999epsilon : 10^-8