Neural Scaling Laws
Paper Notes - non-exhaustive
January 2025
Abstract
Paper Notes
Key Findings
(1)
Performance depends more on model scale, not as much on the architecture.
If performance
(2)
As long as
(3)
Performance improves as both
The penalty depends on the ratio
(4)
Training curves follow predictable power-laws -- by analyzing the beginning of the loss curve, you can predict loss that would be achieved if trained for longer.
(5)
Loss curves when eval'ing on the training validation set are correlated to those on the test set -- by an offset of an approx. constant loss of course.
(6)
Large models are more sample-efficient than smaller models -- and are able to reach same level of performance with fewer optimization steps.
(7)
Fully converging is inefficient -- it's best to stop significantly short from optimal convergence.
(8) The batch size for training the models is a power of the loss and can be determinable by measuring the gradient noise scale.
Laws
For models with a limited count of parameters, trained to full convergence on a sufficiently large dataset
where
For large models trained on a limited dataset, with early stopping:
where
When training with limited compute but a sufficiently large dataset and optimally sized model with a small batch size to make use of optimal compute,
where
They hold over 8 orders of magnitude of compute (
Don't depend strongly on the specific architecture or other hyperparameters.
The
You can combine both prior equations, into a single scaling law, (see 1.5).
The critical batch size, which defines the tradeoff between speed / efficiency for data parallelism obeys a pwoer law in
diverges to infinity as , or the neural network converges to the theoretical optima. Intuitively, as you have a smaller batch size, your gradients will averaged over a smaller set of samples, and therefore given that the surface of the loss gets more "bumpy" the closer you are to the minima, it'd only make sense to increae your batch size as