Quantization

Quantization is a technique where we trade off the precision of the Large Language Model (LLM) weights precision to achieve a smaller model, by default each model weight has a precision of 32 floating points which is equivalent to a single byte and a 7B parameter model would take around 7 Billion x 4Byte = 28GB RAM, which is too much for a model that small size. In this case to reduce the size we can trade off with the precision values such as 16 and reduce the model weight by almost half or even greater!!.

On Ollama there are models tagged with K-Quants, this this type of quantization, the the numbers that are closes are clubbed together and reduce the overall model weight by having weights of similar values closer together, there are three types of K-Quants

  • K_S - less amount of data is retained
  • K_M - moderate amount of data is retained
  • K_L - A large amount of data is retained

Because of this, the overall time for model startup and the the TPS Tokens Per Second will increase

Ollama also supports Context Quantization, by enabling OLLAMA_FLASH_ATTENTION=true and set the OLLAMA_KV_CACHE=F16

  • Post-Training Quantization (PTQ): this refers to techniques that quantize an LLM after it has already been trained. PTQ is easier to implement than QAT, as it requires less training data and is faster. However, it can also result in reduced model accuracy from lost precision in the value of the weights.
  • Quantization-Aware Training (QAT): this refers to methods of fine-tuning on data with quantization in mind. In contrast to PTQ techniques, QAT integrates the weight conversion process, i.e., calibration, range estimation, clipping, rounding, etc., during the training stage. This often results in superior model performance, but is more computationally demanding.

Post-Training Quantization (PTQ)

5 Essential LLM Quantization Techniques Explained > Technique 1 Post-Training Quantization (PTQ)

5 Essential LLM Quantization Techniques Explained > Static PTQ
5 Essential LLM Quantization Techniques Explained > Dynamic PTQ

Quantization-Aware Training (QAT)

5 Essential LLM Quantization Techniques Explained > Technique 2 Quantization-Aware Training (QAT)

GPTQ (Generalized Post-Training Quantization)

A Guide to Quantization in LLMs > GPTQ

AWQ (Activation-aware Weight Quantization)

A Guide to Quantization in LLMs > AWQ

AutoGPTQ

SpQR (Sparse-Quantized Representation)

5 Essential LLM Quantization Techniques Explained > Technique 5 SpQR (Sparse-Quantized Representation)

BitsAndBytes

INT4

INT8