A paper came out recently (2 days ago at the time of writing) titled ‘Symbolic Discovery of Optimization Algorithms’ which presents a discovered optimizer, Lion, found through the use of a genetic algorithm for symbolic program search. I trained a small toy language model to do a quick evaluation of the optimizer. Because of my hardware limitations, the evaluation is of a very small model with sub-optimal hyperparameters to allow it to fit on my device and run in reasonable time.
I trained a 124m parameter GPT2 model from scratch on 16384 examples from the “stanford-crfm/DSIR-filtered-pile-50M” dataset, with the relatively small batch size of 32 and sequence length of 256. A linearly decreasing learning rate was used with the control model, a model optimized by Adam, starting with the learning rate of 1e-3. A suggested learning rate in the paper is 1/10th of the learning rate you would use with Adam, so the experimental model is trained with a learning rate of 1e-4. The implementation of Lion I used is located at https://github.com/lucidrains/lion-pytorch
The paper suggests more gains are made with batch sizes >64.
When plotting the EMA of the loss, the Adam model ends it’s run at 2048 steps with a training loss of 5.93 while the Lion model hits a similar training loss at 1280 steps, using only 62.5% of the steps that the Adam model used. The Lion model ended with a training loss of 5.60 and evaluation loss of 5.44, while the Adam model ended with an evaluation loss of 5.77.
Update: I should have done *some* hparam searching because at batch size 64 the Lion optimizer converges to the same loss as the Adam @ 32bs in only 650-ish steps and I’d like to do a larger scale run of both, but ETA is showing me ~26 hours for one run with the parameters I’m trying now.
Update 2: at batch size 64, Lion took 51.02% of the steps that Adam did to reach the same loss, suggesting the paper was right about higher gains being made with higher batch sizes.