Kwon et al (2023) PagedAttention

This is the paper that proposed PagedAttention and crafted the design of vLLM. The authors pointed out that LLMs are autoregressive. One token is generated from the prompt concatenated with the previously generated sequence. The KV cache (described as “incremental multi-head attention” in the GQA paper) is to share the... [more]

Normalization Zoo

Normalization in deep learning is to shift and scale a tensor such that the activation will run at the sweet spot. This helps to solve the problems such as vanishing/exploding gradients, weight initialization, training stability, and convergence. [more]

Black and White

There is no black and white. Human perceived black for no visible light, and some composition of light wavelengths is perceived as white. To measure the grayscale, we want to quantify what is black and what is white. [more]