Neural Scaling Laws

Paper Notes - non-exhaustive

January 2025

Abstract

Paper Notes

Key Findings

(1)

Performance depends more on model scale, not as much on the architecture.

If performance $P$ is drawn from $S$ , then $S$ is determined by number of parameters $N$ , size of dataset $D$ , and the amount of compute $C$ , then:

P \sim (N, D, C)

(2)

As long as $P$ is not bottlenecked by 2 of $(N, D, C)$ , the increase of the third variable will exponentially increase $P$ (power law).

(3)

Performance improves as both $N$ and $D$ increase, in a predictable manner -- but there are diminshing returns ( a penalty ) if $N$ of $D$ are fixed while the other increases.

The penalty depends on the ratio $\frac{N^{0.74}}{D}$ , where if we increase $N$ by $8\times$ , then we must increase $D$ by $5\times$ to avoid increasing the penalty.

(4)

Training curves follow predictable power-laws -- by analyzing the beginning of the loss curve, you can predict loss that would be achieved if trained for longer.

(5)

Loss curves when eval'ing on the training validation set are correlated to those on the test set -- by an offset of an approx. constant loss of course.

(6)

Large models are more sample-efficient than smaller models -- and are able to reach same level of performance with fewer optimization steps.

(7)

Fully converging is inefficient -- it's best to stop significantly short from optimal convergence.

(8) The batch size for training the models is a power of the loss and can be determinable by measuring the gradient noise scale.

Laws

For models with a limited count of parameters, trained to full convergence on a sufficiently large dataset $D$ :

L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}

where $N_c \sim 8.8 \times 10^{13}$ , $\alpha_N \sim .076$ , and $N$ is the total count of parameters of the model.

For large models trained on a limited dataset, with early stopping:

L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}

where $\alpha_D \sim .095$ , $D_c \sim 5.4 \times 10^{13}$ , and $D$ is number of tokens.

When training with limited compute but a sufficiently large dataset and optimally sized model with a small batch size to make use of optimal compute,

L(C_{\min}) = \left(\frac{C_c^{\min}}{C_{\min}}\right)^{\alpha_C^{\min}}

where $\alpha_C^{\min} \sim .050$ and $C_c^{\min} \sim 3.1 \times 10^8$

They hold over 8 orders of magnitude of compute ( $C_{\min}$ ), 6 orders of magnitude of parameters ( $N$ ), and two orders of magnitude in token size ( $D$ ).

Don't depend strongly on the specific architecture or other hyperparameters.

The $\alpha_X$ where $X \in [N, D, C]$ denotes the degree of improvement as $N, D, C$ are scaled up.

You can combine both prior equations, into a single scaling law, (see 1.5).

The critical batch size, which defines the tradeoff between speed / efficiency for data parallelism obeys a pwoer law in $L$ :

B_{\text{crit}}(L) = \frac{B_{*}}{L^{\frac{1}{\alpha_B}}} \hspace{7mm} B_{*} \sim 2 \cdot 10^8 \text{ tokens}, \alpha_B \sim 0.21

$B_{\text{crit}}$ diverges to infinity as $L \rightarrow 0$ , or the neural network converges to the theoretical optima.

Intuitively, as you have a smaller batch size, your gradients will averaged over a smaller set of samples, and therefore given that the surface of the loss gets more "bumpy" the closer you are to the minima, it'd only make sense to increae your batch size as $L \rightarrow 0$