∫ntegrabℓε ∂iﬀerentiαℓs · unorganised notes, code, and writings of random topics

An Intuitive Analogy to Attention Operation

February 20, 2025 blog math

I tried to explain what is attention, but there should be a simpler explanation of what the attention operation is actually doing. Let’s focus on the core operation of attention, namely: [more]

Kwon et al (2023) PagedAttention

February 5, 2025 paper

This is the paper that proposed PagedAttention and crafted the design of vLLM. The authors pointed out that LLMs are autoregressive. One token is generated from the prompt concatenated with the previously generated sequence. The KV cache (described as “incremental multi-head attention” in the GQA paper) is to share the... [more]

Ainslie et al (2023) GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

February 1, 2025 paper

This is the paper that proposed the Grouped-Query Attention (GQA). While MQA, by sharing the $K,V$ tensors in attention, speeds up the decoder inference, it is found to degrade in quality. GQA is a generalization of MQA. Just like group norm is a generalization between instance norm and layer norm,... [more]

Shazeer (2019) Fast Transformer Decoding. One Write-Head is All You Need

January 31, 2025 paper

This is the paper that proposed the Multi-Query Attention (MQA). The author of is from Google and the idea was explained in detail using TensorFlow code. Firstly, the traditional dot-product attention (single head) is like this: [more]

Normalization Zoo

January 30, 2025 blog math

Normalization in deep learning is to shift and scale a tensor such that the activation will run at the sweet spot. This helps to solve the problems such as vanishing/exploding gradients, weight initialization, training stability, and convergence. [more]