PagedAttention

PagedAttention is a technique which aims to optimize the overall memory utilization of KV Cache as the GPU Memory Utilization for KV Cache increase significantly for Larger Large Language Model (LLM) or longer Context Lengths and Output Sequences. PagedAttention tries to optimize KV Cache by partitioning it into blocks that can be accessed through a Lookup Table. In this manner KV Cache does not need to be stored in contiguous memory and the blocks can be allocated and de-allocated as needed improving the GPU Memory Utilization during memory bound workloads allowing us to accommodate larger Inference batches.

The use of a lookup table also allows the access of KV Cache during multiple output generations where the KV cache blocks can be used to generate multiple outputs for the same prompt at the same time in a technique called Parallel Sampling

tags	ai/largelanguagemodel
source	https://huggingface.co/docs/text-generation-inference/en/conceptual/paged_attention

Quartz 5

Explorer

PagedAttention

PagedAttention

Graph View

Backlinks