RMS Prop & Adam Optimization

RMS Prop

RMS Prop stands for Root Mean Square Prop, which can also speed up gradient descent.

Implementation

On iteration t:
Compute dW, db on the current mini batch
Sdw = beta * Sdw + (1 - beta) * dW^2
Sdb = beta * Sdb + (1 - beta) * db^2
W = W - alpha * dW / sqrt(Sdw)
b = b - alpha * db / sqrt(Sdb)

*To handle rounding and division by zero, we usually replace sqrt(Sdw) and sqrt(Sdb) with sqrt(Sdw + epsilon) and sqrt(Sdb + epsilon) respectively.

Adam Optimization

Most of the optimizations fail to generalize, RMS prop and Adam optimization are a few optimization algorithms that has been proven to work on wide range of algorithms.

Adam optimization algorithm is basically taking momentum and RMS prop together.

Adam: Adaption moment estimation

Implementation

Vdw = 0, Sdw = 0, Vdb = 0, Sdb = 0
On iteration t:
Compute dW. db using the current mini batch
Vdw = beta1 * Vdw + (1 - beta1) * dW
Vdb = beta1 * Vdb + (1 - beta1) * db
Sdw = beta2 * Sdw + (1 - beta2) * dW^2
Sdb = beta2 * Sdb + (1 - beta2) * db^2
Vdw_corrected = Vdw / (1 - beta1^t)
Vdb_corrected = Vdb / (1 - beta1^t)
Sdw_corrected = Sdw / (1 - beta2^t)
Sdb_corrected = Sdb / (1 - beta2^t)
W = W - alpha * Vdw_corrected / sqrt(Sdw_corrected + epsilon)
b = b - alpha * Vdb_corrected / sqrt(Sdb_corrected + epsilon)

Hyperparameters choice

alpha : needs to be tuned
beta1 : 0.9
beta2 : 0.999
epsilon : 10^-8