Quantization ( INCOMPLETE )

Juan Vera

July 2025

Abstract

Mechanics of Quantization

Layer-Wise Quantiztion

The idea of layer-wise quantization is to find a matrix for a layer $\ell$ , specifically a $\widehat{W}_\ell$ that optimizes for the objective, $\argmin_{\widehat{W}_{\ell}}(||W_\ell X_\ell - \widehat{W}_\ell X_{\ell}||_2)^2$ or in other words, finds the set of weights that minimizes the squared difference between the output of layer $\ell$ and it's quantized output.

Optimal Brain Quantization

OPQ starts from the observation that the prior objective equation can be rewritten as the sum of squared errors for each row of $W$ . OBQ then handles each row $w$ independently, quantizing one weight at a time, while still updating the still non-quantized weights in the other rows of $W$ .

Recall that the Hessian is a matrix of second-order partial derivatives of a scalar-valued function with respect to a vector of variables.

Suppose you have a scalar function, $f(x_1, x_2, \dots, x_n)$ .

The gradient vector of $f$ is a vector of first order partial derivatives,

\nabla f = \left<\frac{∂ f}{∂ x_1}, \frac{∂ f}{∂ x_2}, \dots, \frac{∂ f}{∂ x_n}\right> \in \mathbb{R}^n

The Hessian is the matrix of second order partial derivatives of $f$ ,

H = \begin{bmatrix} \frac{∂^2 f}{∂ x_1^2} & \frac{∂^2 f}{∂ x_1 ∂ x_2} & \dots & \frac{∂^2 f}{∂ x_1 ∂ x_n} \\ \frac{∂^2 f}{∂ x_2 ∂ x_1} & \frac{∂^2 f}{∂ x_2^2} & \dots & \frac{∂^2 f}{∂ x_2 ∂ x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{∂^2 f}{∂ x_n ∂ x_1} & \frac{∂^2 f}{∂ x_n ∂ x_2} & \dots & \frac{∂^2 f}{∂ x_n^2} \end{bmatrix}

where $H \in \mathbb{R}^{n \times n}$ , as for every $\frac{∂f}{∂x_i} \in \nabla f$ , we have $n$ second order partial derivatives.

The Hessian of the squared residual objective is $H_F = 2X_FX_F^\top$ , where $F$ denotes the a full-precision matrix.

Then, the greedy-optimal weight scalar to quantize next in a given row of weights, denoted by $w_q$ and the optimal update of all other weights in $F$ , denoted by $\delta_F$ , are given by the following, where $\text{quant}(w)$ rounds $w$ to the nearest value on a quantization grid.

w_q = \argmin_{w_q} \frac{(\text{quant}(w_q) - w_q)^2}{[H_F^{-1}]_{qq}}, \hspace{3mm} \delta_F = - \frac{w_q - \text{quant}(w_q)}{[H_F^{-1}]_{qq}} \cdot (H_F^{-1})_{:, q}

The denominator, $[H_F^{-1}]_{qq}$ , is the inverse of the full-precision Hessian at the index $q, q$ or the corresponding index for $w_q$ with respect to a given $W_\ell$ . We normalize by this term as it captures the curvature of the objective function at the point $w_q$ , which effectively turns into a form of adaptive normalization.

Each time we quantize a given $w_q$ , we update the full-precision Hessian by removing the contribution of $w_q$ and adding the contribution of $\text{quant}(w_q)$ . This needs to be done as the Hessian contains second-order information about the objective function, and if we adjust $w_q$ , that second order information changes, so we need to update the Hessian accordingly.

We can do so by removing the $q$ th row and column of $H$ through a generalized equation,

H_{-q}^{-1} = \left(H^{-1} - \frac{1}{[H^{-1}]_{qq}}H^{-1}_{:, q}H^{-1}_{q, :}\right)_{-p}

We compute the inverse Hessian through this equation, and rather not naively slicing the original Hessian to then compute the inverse, as the inverse Hessian is dependent on the full $n \times n$ matrix and we can't faithfully compute the Hessian at $n - 1 \times n - 1$ by doing so due to coupling terms between the variables in the full matrix.