Quantization

Quantization is a technique where we trade off the precision of the Large Language Model (LLM) weights precision to achieve a smaller model, by default each model weight has a precision of 32 floating points which is equivalent to a single byte and a 7B parameter model would take around 7 Billion x 4Byte = 28GB RAM, which is too much for a model that small size. In this case to reduce the size we can trade off with the precision values such as 16 and reduce the model weight by almost half or even greater!!.

On Ollama there are models tagged with K-Quants, this this type of quantization, the the numbers that are closes are clubbed together and reduce the overall model weight by having weights of similar values closer together, there are three types of K-Quants

K_S - less amount of data is retained
K_M - moderate amount of data is retained
K_L - A large amount of data is retained

Because of this, the overall time for model startup and the the TPS Tokens Per Second will increase

Ollama also supports Context Quantization, by enabling OLLAMA_FLASH_ATTENTION=true and set the OLLAMA_KV_CACHE=F16

Post-Training Quantization (PTQ): this refers to techniques that quantize an LLM after it has already been trained. PTQ is easier to implement than QAT, as it requires less training data and is faster. However, it can also result in reduced model accuracy from lost precision in the value of the weights.
Quantization-Aware Training (QAT): this refers to methods of fine-tuning on data with quantization in mind. In contrast to PTQ techniques, QAT integrates the weight conversion process, i.e., calibration, range estimation, clipping, rounding, etc., during the training stage. This often results in superior model performance, but is more computationally demanding.