This is the paper that proposed the Grouped-Query Attention (GQA). While MQA, by sharing the $K,V$ tensors in attention, speeds up the decoder inference, it is found to degrade in quality. GQA is a generalization of MQA. Just like group norm is a generalization between instance norm and layer norm,...
[more]
Shazeer (2019) Fast Transformer Decoding. One Write-Head is All You Need
This is the paper that proposed the Multi-Query Attention (MQA). The author of
is from Google and the idea was explained in detail using TensorFlow code.
Firstly, the traditional dot-product attention (single head) is like this:
[more]
Normalization Zoo
Normalization in deep learning is to shift and scale a tensor such that the
activation will run at the sweet spot. This helps to solve the problems such as
vanishing/exploding gradients, weight initialization, training stability, and
convergence.
[more]
Black and White
There is no black and white. Human perceived black for no visible light, and
some composition of light wavelengths is perceived as white. To measure the
grayscale, we want to quantify what is black and what is white.
[more]
Timezone in Python
The UNIX epoch is always in UTC. There’s no such thing as local epoch. To get
the epoch in command line, you do date +%s, or in Python, time.time(). It
doesn’t matter if time.localtime() and time.gmtime() are different, the
epoch is universally consistent across timezone.
[more]