Kwon et al (2023) PagedAttention

This is the paper that proposed PagedAttention and crafted the design of vLLM. The authors pointed out that LLMs are autoregressive. One token is generated from the prompt concatenated with the previously generated sequence. The KV cache (described as “incremental multi-head attention” in the GQA paper) is to share the... [more]

Normalization Zoo

Normalization in deep learning is to shift and scale a tensor such that the activation will run at the sweet spot. This helps to solve the problems such as vanishing/exploding gradients, weight initialization, training stability, and convergence. [more]