where l is the sequence length, t is the current token, and j is index for the token up until t.
Assuming q,vt,kt∈Rdmodel where dmodel is the dimensionality of the attention-space.
The dot product, qkt⊤ is a similarity score as qkt⊤=∣∣q∣∣∣∣k∣∣cos(θ), and therefore the more similar (in direction) q and k are, the larger the value of cosθ will be.
The normalization by dmodel is to avoid saturation in the softmax(⋅), as the higher the difference in magnitude the attention scores, αt=∑j=1lexp(qkj⊤)exp(qkt⊤) without normalization by dmodel will lead to some vector, vt to have an extremely higher magnitude (if some αt is extremely large relative to other αt) relative to other vt and respectively, the model will attend to the vt with higher magnitude, extremely more than those with low magnitude, if unnormalized.
While this is normal behavior, desired to some degree, normalization allows for a much more evenly distributed attention-score matrix, where surrounding tokens play a larger role into the next-token prediction unlike what would've been the case without normalization.
After computing a total of lαt via softmax, we compute the element-wise product with all vt and then sum, to get v^t.
We can define this same operation as a matrix multiplication:
(2)Attention(q,K,V)=softmax(dqK⊤)V
where
K,V∈Rl×dmodel, where l is the sequence length and dmodel is the dimensionality of the attention space.
α=softmax(qK⊤)∈Rl
q∈Rdmodel
I won't waste my time trying to right mathematical notation for this, but essentially, you can define it as:
import torch
import torch.nn.functional as F
seq_len = 10
d_model = 256# a common embedding size
q = torch.randn(size = (d_model,))
K = torch.randn( size = (seq_len, d_model) )
V = torch.randn(size = (seq_len, d_model))
attn_probs = F.softmax(torch.matmul(q, K.transpose(0, 1)))
print(attn_probs.shape)
We can simply matmul α and V as the operation is essentially the equivalent of the summation of all lα,vt multiplications.
where when we multiply the jth column in V with α, we're equivalently computing the summation → multiplication in (1), to get the output, v^t, the vector which extracts how much "attention" the model should pay to the tth token.
If you ran:
out = torch.matmul(attn_probs, V)
print(out.shape) # d_model
you'd get
out
as a vector in Rdmodel.
Of course, to compute all attention scores and correspondingly the full result, V^∈Rl×dmodel, we can define q as a matrix as well:
you can see that for a given q, you get a vector of attention scores, α, for that given q with respect to all rows, ki∈K (or all columns in its transpose).
Given that during autoregressive generation, you only need to predict the next token, having a set of attention scores as the vector α rather than the full matrix A∈Rl×l, it's redundant to cache Q at all, when all you really need is q, to compute the scaled dot product.