Quantization ( INCOMPLETE )
Juan Vera
July 2025
Abstract
Mechanics of Quantization
Layer-Wise Quantiztion
The idea of layer-wise quantization is to find a matrix for a layer , specifically a that optimizes for the objective, or in other words, finds the set of weights that minimizes the squared difference between the output of layer and it's quantized output.
Optimal Brain Quantization
OPQ starts from the observation that the prior objective equation can be rewritten as the sum of squared errors for each row of . OBQ then handles each row independently, quantizing one weight at a time, while still updating the still non-quantized weights in the other rows of .
Recall that the Hessian is a matrix of second-order partial derivatives of a scalar-valued function with respect to a vector of variables.
Suppose you have a scalar function, .
The gradient vector of is a vector of first order partial derivatives,
The Hessian is the matrix of second order partial derivatives of ,
where , as for every , we have second order partial derivatives.
The Hessian of the squared residual objective is , where denotes the a full-precision matrix.
Then, the greedy-optimal weight scalar to quantize next in a given row of weights, denoted by and the optimal update of all other weights in , denoted by , are given by the following, where rounds to the nearest value on a quantization grid.
The denominator, , is the inverse of the full-precision Hessian at the index or the corresponding index for with respect to a given . We normalize by this term as it captures the curvature of the objective function at the point , which effectively turns into a form of adaptive normalization.
Each time we quantize a given , we update the full-precision Hessian by removing the contribution of and adding the contribution of . This needs to be done as the Hessian contains second-order information about the objective function, and if we adjust , that second order information changes, so we need to update the Hessian accordingly.
We can do so by removing the th row and column of through a generalized equation,
We compute the inverse Hessian through this equation, and rather not naively slicing the original Hessian to then compute the inverse, as the inverse Hessian is dependent on the full matrix and we can't faithfully compute the Hessian at by doing so due to coupling terms between the variables in the full matrix.