Neural Scaling Laws

Paper Notes - non-exhaustive

January 2025

Abstract

Paper Notes

Key Findings

(1)

Performance depends more on model scale, not as much on the architecture.

If performance PP is drawn from SS, then SS is determined by number of parameters NN, size of dataset DD, and the amount of compute CC, then:

P(N,D,C)P \sim (N, D, C)

(2)

As long as PP is not bottlenecked by 2 of (N,D,C)(N, D, C), the increase of the third variable will exponentially increase PP (power law).

(3)

Performance improves as both NN and DD increase, in a predictable manner -- but there are diminshing returns ( a penalty ) if NN of DD are fixed while the other increases.

The penalty depends on the ratio N0.74D\frac{N^{0.74}}{D}, where if we increase NN by 8×8\times, then we must increase DD by 5×5\times to avoid increasing the penalty.

(4)

Training curves follow predictable power-laws -- by analyzing the beginning of the loss curve, you can predict loss that would be achieved if trained for longer.

(5)

Loss curves when eval'ing on the training validation set are correlated to those on the test set -- by an offset of an approx. constant loss of course.

(6)

Large models are more sample-efficient than smaller models -- and are able to reach same level of performance with fewer optimization steps.

(7)

Fully converging is inefficient -- it's best to stop significantly short from optimal convergence.

(8) The batch size for training the models is a power of the loss and can be determinable by measuring the gradient noise scale.

Laws

For models with a limited count of parameters, trained to full convergence on a sufficiently large dataset DD:

L(N)=(NcN)αNL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}

where Nc8.8×1013N_c \sim 8.8 \times 10^{13}, αN.076\alpha_N \sim .076, and NN is the total count of parameters of the model.

For large models trained on a limited dataset, with early stopping:

L(D)=(DcD)αDL(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}

where αD.095\alpha_D \sim .095, Dc5.4×1013D_c \sim 5.4 \times 10^{13}, and DD is number of tokens.

When training with limited compute but a sufficiently large dataset and optimally sized model with a small batch size to make use of optimal compute,

L(Cmin)=(CcminCmin)αCminL(C_{\min}) = \left(\frac{C_c^{\min}}{C_{\min}}\right)^{\alpha_C^{\min}}

where αCmin.050\alpha_C^{\min} \sim .050 and Ccmin3.1×108C_c^{\min} \sim 3.1 \times 10^8

They hold over 8 orders of magnitude of compute (CminC_{\min}), 6 orders of magnitude of parameters (NN), and two orders of magnitude in token size (DD).

Don't depend strongly on the specific architecture or other hyperparameters.

The αX\alpha_X where X[N,D,C]X \in [N, D, C] denotes the degree of improvement as N,D,CN, D, C are scaled up.

You can combine both prior equations, into a single scaling law, (see 1.5).

The critical batch size, which defines the tradeoff between speed / efficiency for data parallelism obeys a pwoer law in LL:

Bcrit(L)=BL1αBB2108 tokens,αB0.21B_{\text{crit}}(L) = \frac{B_{*}}{L^{\frac{1}{\alpha_B}}} \hspace{7mm} B_{*} \sim 2 \cdot 10^8 \text{ tokens}, \alpha_B \sim 0.21

BcritB_{\text{crit}} diverges to infinity as L0L \rightarrow 0, or the neural network converges to the theoretical optima.

Intuitively, as you have a smaller batch size, your gradients will averaged over a smaller set of samples, and therefore given that the surface of the loss gets more "bumpy" the closer you are to the minima, it'd only make sense to increae your batch size as L0L \rightarrow 0