Stablemax

After a network memorizes the training data, cross-entropy still pushes the logits to grow bigger and bigger, so argmax of the logits is the same but cross-entropy decreases. If this goes on long enough, your softmax hits the limit of floating-point precision. To prevent this we need regularization. But using StableMax gives Grokking without regularization. [https://arxiv.org/abs/2501.04697 Grokking at the Edge of Numerical Stability]

Backlinks

Grokking
Soft Max