Quantization ( INCOMPLETE )

Juan Vera

July 2025

Abstract

Mechanics of Quantization

Layer-Wise Quantiztion

The idea of layer-wise quantization is to find a matrix for a layer \ell, specifically a W^\widehat{W}_\ell that optimizes for the objective, arg minW^(WXW^X2)2\argmin_{\widehat{W}_{\ell}}(||W_\ell X_\ell - \widehat{W}_\ell X_{\ell}||_2)^2 or in other words, finds the set of weights that minimizes the squared difference between the output of layer \ell and it's quantized output.

Optimal Brain Quantization

OPQ starts from the observation that the prior objective equation can be rewritten as the sum of squared errors for each row of WW. OBQ then handles each row ww independently, quantizing one weight at a time, while still updating the still non-quantized weights in the other rows of WW.

Recall that the Hessian is a matrix of second-order partial derivatives of a scalar-valued function with respect to a vector of variables.

Suppose you have a scalar function, f(x1,x2,,xn)f(x_1, x_2, \dots, x_n).

The gradient vector of ff is a vector of first order partial derivatives,

f=<fx1,fx2,,fxn>Rn\nabla f = \left<\frac{∂ f}{∂ x_1}, \frac{∂ f}{∂ x_2}, \dots, \frac{∂ f}{∂ x_n}\right> \in \mathbb{R}^n

The Hessian is the matrix of second order partial derivatives of ff,

H=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]H = \begin{bmatrix} \frac{∂^2 f}{∂ x_1^2} & \frac{∂^2 f}{∂ x_1 ∂ x_2} & \dots & \frac{∂^2 f}{∂ x_1 ∂ x_n} \\ \frac{∂^2 f}{∂ x_2 ∂ x_1} & \frac{∂^2 f}{∂ x_2^2} & \dots & \frac{∂^2 f}{∂ x_2 ∂ x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{∂^2 f}{∂ x_n ∂ x_1} & \frac{∂^2 f}{∂ x_n ∂ x_2} & \dots & \frac{∂^2 f}{∂ x_n^2} \end{bmatrix}

where HRn×nH \in \mathbb{R}^{n \times n}, as for every fxif\frac{∂f}{∂x_i} \in \nabla f, we have nn second order partial derivatives.

The Hessian of the squared residual objective is HF=2XFXFH_F = 2X_FX_F^\top, where FF denotes the a full-precision matrix.

Then, the greedy-optimal weight scalar to quantize next in a given row of weights, denoted by wqw_q and the optimal update of all other weights in FF, denoted by δF\delta_F, are given by the following, where quant(w)\text{quant}(w) rounds ww to the nearest value on a quantization grid.

wq=arg minwq(quant(wq)wq)2[HF1]qq,δF=wqquant(wq)[HF1]qq(HF1):,qw_q = \argmin_{w_q} \frac{(\text{quant}(w_q) - w_q)^2}{[H_F^{-1}]_{qq}}, \hspace{3mm} \delta_F = - \frac{w_q - \text{quant}(w_q)}{[H_F^{-1}]_{qq}} \cdot (H_F^{-1})_{:, q}

The denominator, [HF1]qq[H_F^{-1}]_{qq}, is the inverse of the full-precision Hessian at the index q,qq, q or the corresponding index for wqw_q with respect to a given WW_\ell. We normalize by this term as it captures the curvature of the objective function at the point wqw_q, which effectively turns into a form of adaptive normalization.

Each time we quantize a given wqw_q, we update the full-precision Hessian by removing the contribution of wqw_q and adding the contribution of quant(wq)\text{quant}(w_q). This needs to be done as the Hessian contains second-order information about the objective function, and if we adjust wqw_q, that second order information changes, so we need to update the Hessian accordingly.

We can do so by removing the qqth row and column of HH through a generalized equation,

Hq1=(H11[H1]qqH:,q1Hq,:1)pH_{-q}^{-1} = \left(H^{-1} - \frac{1}{[H^{-1}]_{qq}}H^{-1}_{:, q}H^{-1}_{q, :}\right)_{-p}

We compute the inverse Hessian through this equation, and rather not naively slicing the original Hessian to then compute the inverse, as the inverse Hessian is dependent on the full n×nn \times n matrix and we can't faithfully compute the Hessian at n1×n1n - 1 \times n - 1 by doing so due to coupling terms between the variables in the full matrix.