This is the paper that proposed PagedAttention and crafted the design of vLLM. The authors pointed out that LLMs are autoregressive. One token is generated from the prompt concatenated with the previously generated sequence. The KV cache (described as “incremental multi-head attention” in the GQA paper) is to share the...
[more]
Ainslie et al (2023) GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
This is the paper that proposed the Grouped-Query Attention (GQA). While MQA, by sharing the $K,V$ tensors in attention, speeds up the decoder inference, it is found to degrade in quality. GQA is a generalization of MQA. Just like group norm is a generalization between instance norm and layer norm,...
[more]
Shazeer (2019) Fast Transformer Decoding. One Write-Head is All You Need
This is the paper that proposed the Multi-Query Attention (MQA). The author of
is from Google and the idea was explained in detail using TensorFlow code.
Firstly, the traditional dot-product attention (single head) is like this:
[more]
Normalization Zoo
Normalization in deep learning is to shift and scale a tensor such that the
activation will run at the sweet spot. This helps to solve the problems such as
vanishing/exploding gradients, weight initialization, training stability, and
convergence.
[more]
Black and White
There is no black and white. Human perceived black for no visible light, and
some composition of light wavelengths is perceived as white. To measure the
grayscale, we want to quantify what is black and what is white.
[more]