Neural Scaling Laws
Paper Notes - non-exhaustive
January 2025
Abstract
Paper Notes
Key Findings
(1)
Performance depends more on model scale, not as much on the architecture.
If performance is drawn from , then is determined by number of parameters , size of dataset , and the amount of compute , then:
(2)
As long as is not bottlenecked by 2 of , the increase of the third variable will exponentially increase (power law).
(3)
Performance improves as both and increase, in a predictable manner -- but there are diminshing returns ( a penalty ) if of are fixed while the other increases.
The penalty depends on the ratio , where if we increase by , then we must increase by to avoid increasing the penalty.
(4)
Training curves follow predictable power-laws -- by analyzing the beginning of the loss curve, you can predict loss that would be achieved if trained for longer.
(5)
Loss curves when eval'ing on the training validation set are correlated to those on the test set -- by an offset of an approx. constant loss of course.
(6)
Large models are more sample-efficient than smaller models -- and are able to reach same level of performance with fewer optimization steps.
(7)
Fully converging is inefficient -- it's best to stop significantly short from optimal convergence.
(8) The batch size for training the models is a power of the loss and can be determinable by measuring the gradient noise scale.
Laws
For models with a limited count of parameters, trained to full convergence on a sufficiently large dataset :
where , , and is the total count of parameters of the model.
For large models trained on a limited dataset, with early stopping:
where , , and is number of tokens.
When training with limited compute but a sufficiently large dataset and optimally sized model with a small batch size to make use of optimal compute,
where and
They hold over 8 orders of magnitude of compute (), 6 orders of magnitude of parameters (), and two orders of magnitude in token size ().
Don't depend strongly on the specific architecture or other hyperparameters.
The where denotes the degree of improvement as are scaled up.
You can combine both prior equations, into a single scaling law, (see 1.5).
The critical batch size, which defines the tradeoff between speed / efficiency for data parallelism obeys a pwoer law in :
diverges to infinity as , or the neural network converges to the theoretical optima.
Intuitively, as you have a smaller batch size, your gradients will averaged over a smaller set of samples, and therefore given that the surface of the loss gets more "bumpy" the closer you are to the minima, it'd only make sense to increae your batch size as