Large Language Models (LLMs) are complex and require a significant amount of computational resources to function effectively. They are designed to understand and generate human-like text, which involves processing vast amounts of data. This blog explores how Quantisation can be an effective strategy to save compute time and costs while getting a similar quality.

As LLMs are scaled up to improve their performance, they require more parameters to understand the intricacies of human language. For instance:

  • Mistral 7B uses 7 billion parameters,
  • Meta’s Llama 3 uses 400 billion parameters, and
  • OpenAI’s GPT-4 uses 1.76 trillion parameters!

These parameters are stored in memory, increasing the model's memory footprint. This can be a major challenge, as handling such a large amount of data can limit the model's speed and efficiency due to slower processing times and increased energy consumption.

To address this issue, researchers have developed a technique known as quantization. Through quantization, one can reduce the memory requirement of LLMs without significantly impacting their performance. This beginner’s guide explains the concept of quantization, why it is important, and how it can be applied.

What is Quantisation?

Quantization is a process that reduces the numerical precision of a model's parameters by utilizing lower-precision representations to store its weights. This approach enables the reduction of memory requirements for each parameter, subsequently decreasing the overall memory footprint of the model.

To illustrate this concept, consider a model that initially employs 32-bit floating-point numbers to represent its weights. By quantizing these weights, they can be represented using 16-bit or even 8-bit integers. This transformation enables the deployment of these models on smaller GPU hardware while maintaining high performance and reducing costs and resource utilization.

If you're interested, you can view how you can leverage other techniques except quantization to optimize LLMs here.

The Math Behind Quantisation

The most commonly done transformations within quantization include float32 to float16 and float32 to int8.

Half-precision Floating Point Quantization (float32 to float16):

This process essentially involves reducing the precision of the floating point numbers representing the model's weights from 32 bits to 16 bits. This works by altering the numbers' bit-level representations in memory. Now, in terms of representation,

  • float32 is represented using 1 sign bit, 8 bits for the exponent, and 23 bits for the fraction.
  • float16 is represented using 1 sign bit, 5 bits for the exponent, and 10 bits for the fraction.

Now, since both representations are continuous, quantization can be easily achieved by rounding the least significant bits of the fraction and those of the exponent of the float32 number to fit into the 10-bit and 5-bit spaces of the float16 number, respectively. This process effectively halves the memory usage per parameter.

Integer Quantisation (float32 to int8):

Integer quantization is a process that converts the 32-bit floating-point numbers representing the model's weights to 8-bit integers. It is more complex than half-precision floating-point quantization because it transitions from continuous floating-point numbers to discrete integers.

To perform this conversion, the following steps are followed:

  1. The range of the floating point values is determined.

  2. This range is mapped to the range of int8, which is -128 to 127. The mapping is performed using the formula:

    Q = round(Clip(F * S + Z, min, max)), where:

    • F is the float32 value,
    • S is a positive float32 scaling factor,
    • Z is the zero point (an int8 value that corresponds to zero in the float32 space)

    NOTE: Clip() is a function that restricts the value within a specific range. Here, "min" is -128, and "max" is 127 (the minimum and maximum values of int8).

This process reduces the memory requirement per parameter by a factor of four.

LLM Quantisation Techniques

Quantization techniques are divided into two main categories: post-training quantization (PTQ) and quantization-aware training (QAT).

Post-training Quantisation (PTQ)

Post-training quantization (PTQ) is when quantization is applied to a model after training. This technique's key benefit is that it does not require any model retraining, making it efficient regarding computational resources. Some examples of PTQ techniques include QLoRA, GPTQ, and AWQ.


QLoRA stands for Quantised Low-Rank Adaptation. It is a modified version of LoRA (or Low-Rank Adaptation), a technique designed to optimize large language models only by adapting specific parts of the LLM. Mathematically, LoRA involves fine-training a low-rank matrix representing the target subspace rather than retraining the whole model. QLoRA is an extension of this in that it also quantizes the original weights.

QLoRA primarily involves the following steps:

  1. First, it quantises the LLM to int4.
  2. Then, it adds low-rank adapter weights to the quantized LLM. These low-rank adapter weights are achieved by using low-rank decomposition, which allows for several parameters.
  3. Afterwards, during the fine-tuning process, these adapters are updated, retaining the performance while keeping the memory usage at a minimum.

Therefore, in this manner, QLoRA enables post-training quantization.


GPTQ, or Generative Pre-trained Transformer Quantisation, is another technique that falls under PTQ. It involves a one-shot weight quantization method built on a layer-by-layer quantization technique and Optimal Brain Quantisation (OBQ), which finds the optimal quantization points for each layer to minimize the error between the original representation and the quantized representation.

Explained simply, GPTQ involves first executing a 4-bit quantization (int4). Mathematically, this involves a Cholesky decomposition of the Hessian inverse — a means by which the weight adjustments are computed. Then, handling the matrix columns in batches, the weights are quantized for each column in the batch, the quantization error is calculated, and the weights are accordingly changed. In the end, the remaining weights are changed based on the errors. Then, during inference, it will compute the de-quantized weights in float16 format to ensure a high-quality performance while retaining low memory usage.

There are many open-source models like AutoGPTQ that achieve this quantization.


AWQ, or Activation-Aware Weight Quantisation, is similar to GPTQ, but it observes the activations (the value/weight of the connections within the neural network) and selectively quantizes only the “less important” weights. In this way, the model's performance is retained, and the quantized weights use less memory. The primary steps involved include:

  1. Collecting the activation statistics of the input dataset (to find the important weights).
  2. Analyzing how sensitive the model’s output is to the weight in question if that particular weight is quantized. This step involves iteratively checking and marking the model's sensitivity to every weight and understanding the weights' priorities.
  3. Based on these priorities, the weights are grouped and quantized with different levels of aggressiveness — the more important weights are quantized with a higher precision (more number of bits), and the less important weights are quantized with a lower precision (less number of bits).

Quantisation-Aware Training (QAT)

Quantization-aware training (QAT) is a technique in which quantization is incorporated into the training phase. This means the model is trained to understand the loss of precision due to reduced numerical precision and compensates for it during the training phase. This operates as follows:

  1. First, during a model's training, simulation quantization operations are performed on the model, imitating the actual rounding and clipping that would be done in an actual quantization during PTQ.
  2. Then, during the neural network's forward propagation, low-precision behaviour is seen due to the simulated quantization operations.
  3. Then, during backpropagation, the model learns how to compensate for the low-precision behaviour.

So, when the model faces actual quantization operations after training is complete, the generated output is in a lower-precision format.


In conclusion, quantization offers a promising solution to the challenges posed by the ever-increasing size of Large Language Models. As we continue to scale up the size of our models and strive for more complex and accurate language understanding, techniques like quantization will become increasingly important. Simplismart uses quantization to achieve our state-of-the-art inference engine speeds.

Want to learn about Simplismart's Quantised Product Suite?
Simplismart has significantly optimized its inference engine for TTS, STT, and LLMs. 

The world of large language models is evolving rapidly, and we should expect even more intriguing solutions to emerge soon. Stay tuned!