This is the paper that proposed the Multi-Query Attention (MQA). The author of
is from Google and the idea was explained in detail using TensorFlow code.
Firstly, the traditional dot-product attention (single head) is like this:
[more]
Normalization Zoo
Normalization in deep learning is to shift and scale a tensor such that the
activation will run at the sweet spot. This helps to solve the problems such as
vanishing/exploding gradients, weight initialization, training stability, and
convergence.
[more]
Black and White
There is no black and white. Human perceived black for no visible light, and
some composition of light wavelengths is perceived as white. To measure the
grayscale, we want to quantify what is black and what is white.
[more]
Timezone in Python
The UNIX epoch is always in UTC. There’s no such thing as local epoch. To get
the epoch in command line, you do date +%s, or in Python, time.time(). It
doesn’t matter if time.localtime() and time.gmtime() are different, the
epoch is universally consistent across timezone.
[more]
concurrent.futures in Python
The Python standard library concurrent.futures is the easiest way to run
parallel jobs in the best-effort manner. In case the heavy jobs are run
off-interpreter (e.g., NumPy) using the thread pool from concurrent.futures
can give you some noticeable performance benefit.
[more]