Kirillov et al (2023) Segment Anything

The goal of the paper is to produce a promptable segmentation model. It is a model that supports flexible prompting and can output segmentation masks in real time. For any given segmentation prompt (e.g., a point in the image, and may be ambiguous), the model is expected to return a... [more]

Dosovitskiy et al (2021) An Image is Worth 16x16 Words

This is the paper that introduced the Vision Transformer (ViT), which proposed that transformers can be used for image classification replacing CNNs. Inspired by the success of transformer models in NLP, this paper explored the technique of using transformers to process 2D image data. The goal is to create a... [more]

Kwon et al (2023) PagedAttention

This is the paper that proposed PagedAttention and crafted the design of vLLM. The authors pointed out that LLMs are autoregressive. One token is generated from the prompt concatenated with the previously generated sequence. The KV cache (described as “incremental multi-head attention” in the GQA paper) is to share the... [more]