(Q?)KV Cache

Juan Vera

April 2025

Abstract

I got tired of intuitively knowing how KV-Cache works without seeing it from first principles for myself, so here you go.

Preliminaries

We can define the Attention Mechanism as:

(1) \hspace{5mm}\text{Attention}(q, K, V) = \sum_{t=1}^l \left(\frac{\exp({\frac{qk_t^{\top}}{\sqrt{d_{\text{model}}}}})}{\sum_{j=1}^{l}\exp(\frac{qk_j^\top}{\sqrt{d_{\text{model}}}})}\right) \odot v_t \\[3mm] = \hat{v_t} \in \mathbb{R}^{d_{\text{model}}}

where $l$ is the sequence length, $t$ is the current token, and $j$ is index for the token up until $t$ .

Assuming $q, v_t, k_t \in \mathbb{R}^{d_{\text{model}}}$ where $d_{\text{model}}$ is the dimensionality of the attention-space.

The dot product, $qk_t^{\top}$ is a similarity score as $qk_t^{\top} = ||q||||k||\cos(\theta)$ , and therefore the more similar (in direction) $q$ and $k$ are, the larger the value of $\cos{\theta}$ will be.

The normalization by $\sqrt{d_{\text{model}}}$ is to avoid saturation in the $\text{softmax}(\cdot)$ , as the higher the difference in magnitude the attention scores, $\alpha_t = \frac{\exp(qk_t^{\top})}{\sum_{j=1}^{l}\exp(qk_j^\top)}$ without normalization by $\sqrt{d_{\text{model}}}$ will lead to some vector, $v_t$ to have an extremely higher magnitude (if some $\alpha_t$ is extremely large relative to other $\alpha_t$ ) relative to other $v_t$ and respectively, the model will attend to the $v_t$ with higher magnitude, extremely more than those with low magnitude, if unnormalized.

While this is normal behavior, desired to some degree, normalization allows for a much more evenly distributed attention-score matrix, where surrounding tokens play a larger role into the next-token prediction unlike what would've been the case without normalization.

After computing a total of $l$ $\alpha_t$ via $\text{softmax}$ , we compute the element-wise product with all $v_t$ and then sum, to get $\hat{v}_t$ .

We can define this same operation as a matrix multiplication:

(2) \hspace{5mm} \text{Attention}(q, K, V) = \text{softmax}(\frac{qK^\top}{\sqrt{d}})V

where

$K, V \in \mathbb{R}^{l \times d_{\text{model}}}$ , where $l$ is the sequence length and $d_{\text{model}}$ is the dimensionality of the attention space.
$\vec{\alpha} = \text{softmax}(qK^\top) \in \mathbb{R}^{l}$
$q \in \mathbb{R}^{d_\text{model}}$

I won't waste my time trying to right mathematical notation for this, but essentially, you can define it as:

import torch
import torch.nn.functional as F

seq_len = 10
d_model = 256 # a common embedding size

q = torch.randn(size = (d_model,))
K = torch.randn( size = (seq_len, d_model) )
V = torch.randn(size = (seq_len, d_model))

attn_probs = F.softmax(torch.matmul(q, K.transpose(0, 1)))
print(attn_probs.shape)

We can simply matmul $\vec{\alpha}$ and $V$ as the operation is essentially the equivalent of the summation of all $l$ $\alpha, v_t$ multiplications.

If $V \in \mathbb{R}^{l\times d_{\text{model}}}$ then,

(2.5) \hspace{5mm}\begin{bmatrix} \alpha_1 & \alpha_2 & \cdots & \alpha_l \end{bmatrix} \begin{bmatrix} v_{1,1} & v_{1,2} & \cdots & v_{1,d_{\text{model}}} \\ v_{2,1} & v_{2,2} & \cdots & v_{2,d_{\text{model}}} \\ \vdots & \vdots & \ddots & \vdots \\ v_{l,1} & v_{l,2} & \cdots & v_{l,d_{\text{model}}} \end{bmatrix} \in \mathbb{R}^{d_{\text{model}}}

where when we multiply the $j$ th column in $V$ with $\vec{\alpha}$ , we're equivalently computing the summation $\rightarrow$ multiplication in $(1)$ , to get the output, $\hat{v}_t$ , the vector which extracts how much "attention" the model should pay to the $t$ th token.

If you ran:

out = torch.matmul(attn_probs, V)
print(out.shape) # d_model

you'd get

out

as a vector in

\mathbb{R}^{d_{\text{model}}}

Of course, to compute all attention scores and correspondingly the full result, $\hat{V} \in \mathbb{R}^{l \times d_{\text{model}}}$ , we can define $q$ as a matrix as well:

(3) \hspace{5mm} \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^\top}{\sqrt{d}})V = \hat{V} \rightarrow \mathbb{R}^{l \times d_{\text{model}}}

where $Q \in \mathbb{R}^{l \times d_{\text{model}}}$ , such that:

\begin{bmatrix} \alpha_{1,1} & \alpha_{1,2} & \cdots & \alpha_{1,l} \\ \alpha_{2,1} & \alpha_{2,2} & \cdots & \alpha_{2,l} \\ \vdots & \vdots & \ddots & \vdots \\ \alpha_{l,1} & \alpha_{l,2} & \cdots & \alpha_{l,l} \end{bmatrix} \begin{bmatrix} v_{1,1} & v_{1,2} & \cdots & v_{1,d_{\text{model}}} \\ v_{2,1} & v_{2,2} & \cdots & v_{2,d_{\text{model}}} \\ \vdots & \vdots & \ddots & \vdots \\ v_{l,1} & v_{l,2} & \cdots & v_{l,d_{\text{model}}} \end{bmatrix} \rightarrow \mathbb{R}^{l \times d_{\text{model}}}

or coded out:

import torch
import torch.nn.functional as F

seq_len = 10
d_model = 256 # a common embedding size

Q = torch.randn(size = (seq_len, d_model))
K = torch.randn( size = (seq_len, d_model) )
V = torch.randn(size = (seq_len, d_model))

attn_probs = F.softmax(torch.matmul(Q, K.transpose(0, 1)))
print(attn_probs.shape) # (seq_len, seq_len)

allows us to then compute the matrix multiplication with $V$ to get $\hat{V} \in \mathbb{R}^{l \times d_{\text{model}}}$ .

out = torch.matmul(attn_probs, V)
print(out.shape) # (seq_len, d_model)

Of course, you can further paralellize this process by computing this in batches.

KV Cache

To answer the question, "why the hell can we cache the K and V matrices?"

Given the definition for the attention mechanism, $(3)$ :

(3) \hspace{5mm} \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^\top}{\sqrt{d_{\text{model}}}})V = \hat{V} \rightarrow \mathbb{R}^{l \times d_{\text{model}}}

where:

$Q \in \mathbb{R}^{l \times d_{\text{model}}}$
$K \in \mathbb{R}^{l \times d_{\text{model}}}$
$V \in \mathbb{R}^{l \times d_{\text{model}}}$
$QK^{\top} \in \mathbb{R}^{l \times l}$

caching becomes important in autoregressive settings -- where we always predict the next token, $t$ , given an input of $\text{len}() \rightarrow t - 1$ .

Where we have $Q, K, V$ and $X \in \mathbb{R}^{l \times d_{\text{model}}}$ :

Q_t = X_tW_Q \\[3mm] V_t = X_tW_V \\[3mm] K_t = X_tW_K \\[3mm]

all in $\mathbb{R}^{l \times d_{\text{model}}}$ or equivalently, during autoregressive inference where $l = t$ , $\mathbb{R}^{t \times d_{\text{model}}}$ .

When predicting the next token, $t + 1$ , $Q, K, V$ comes to be $\in \mathbb{R}^{(t+1) \times d_{\text{model}}}$ , such that:

K_{t+1} = \begin{bmatrix} k_1 \\ k_2 \\ \vdots \\ k_t \\ k_{t+1}\end{bmatrix} = X_{t+1}W_Q \\[4mm] V_{t+1} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_t \\ v_{t+1}\end{bmatrix} = X_{t+1}W_V

Notice that:

K_{t+1} = \begin{bmatrix} K_t \\ k_{t+1} \end{bmatrix} \\[3mm] V_{t+1} = \begin{bmatrix} V_t \\ v_{t+1} \end{bmatrix}

such that you can cache any $K_t$ and $V_t$ , and reuse it, while only having to compute $k_{t+1}$ .

Q Cache?

Looking back at equation $(2)$ and $(2.5)$ :

(2) \hspace{5mm} \text{Attention}(q, K, V) = \text{softmax}(\frac{qK^\top}{\sqrt{d}})V \\[5mm] (2.5) \hspace{5mm}\begin{bmatrix} \alpha_1 & \alpha_2 & \cdots & \alpha_l \end{bmatrix} \begin{bmatrix} v_{1,1} & v_{1,2} & \cdots & v_{1,d_{\text{model}}} \\ v_{2,1} & v_{2,2} & \cdots & v_{2,d_{\text{model}}} \\ \vdots & \vdots & \ddots & \vdots \\ v_{l,1} & v_{l,2} & \cdots & v_{l,d_{\text{model}}} \end{bmatrix} \in \mathbb{R}^{d_{\text{model}}}

you can see that for a given $q$ , you get a vector of attention scores, $\vec{\alpha}$ , for that given $q$ with respect to all rows, $k_i \in K$ (or all columns in its transpose).

Given that during autoregressive generation, you only need to predict the next token, having a set of attention scores as the vector $\vec{\alpha}$ rather than the full matrix $\Alpha \in \mathbb{R}^{l \times l}$ , it's redundant to cache $Q$ at all, when all you really need is $q$ , to compute the scaled dot product.