KV Cache

KV Cache is an optimization technique to improve the overall Inference of a Large Language Model (LLM) by reducing the redundancy of re-computing the Vector of tokes again and again while the model is generating the output in a auto-regressive manner in the Decoder mode. The Key and Values are stored in memory ready to be accessed without being re-computed for the the next token generation cycle, this vastly increases the inference speed and the latency while reducing the overall resources required to generate a response from the model.

Read Understanding and Coding the KV Cache in LLMs from Scratch which goes into more detail in undestanding KV Cache.

KV Cache is also a very important component in vLLM which makes it one of the fastest Inference projects as of writing (2025)