The Quantization Promise
βJust quantize your model to INT8 and deploy it on edge hardware.β I hear this weekly. And itβs true β mostly. But βmostlyβ can mean a 2% accuracy drop that costs a manufacturing client $50K/month in missed defects.
Letβs talk about when quantization works, when it breaks, and how to know the difference.
What Quantization Does
Neural network weights are stored as 32-bit floating point numbers (FP32). Quantization reduces precision:
FP32 β 32 bits per weight β Baseline accuracy
FP16 β 16 bits per weight β ~0% accuracy loss, 2Γ smaller
INT8 β 8 bits per weight β 0-2% accuracy loss, 4Γ smaller
INT4 β 4 bits per weight β 1-5% accuracy loss, 8Γ smallerSmaller models = faster inference + less memory + lower power. The question is always: how much accuracy do you lose?
Post-Training Quantization (PTQ)
The simplest approach. Take a trained FP32 model, convert weights to INT8:
import torch
from torch.quantization import quantize_dynamic
# Load your FP32 model
model = torch.load('model_fp32.pth')
# Dynamic quantization (CPU inference)
quantized_model = quantize_dynamic(
model,
{torch.nn.Linear}, # Which layers to quantize
dtype=torch.qint8
)
# Save β 4x smaller
torch.save(quantized_model.state_dict(), 'model_int8.pth')For ONNX models (recommended for edge):
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
'model.onnx',
'model_int8.onnx',
weight_type=QuantType.QInt8
)When PTQ Works Well
- Classification models (ResNet, EfficientNet)
- Object detection (YOLO, SSD)
- Simple NLP (sentiment analysis, classification)
- Models with >10M parameters
When PTQ Breaks
- Small models (<1M parameters) β not enough redundancy
- Regression tasks β subtle numerical precision matters
- Models with outlier weights β a few extreme values distort the quantization range
Quantization-Aware Training (QAT)
When PTQ accuracy drops too much, train with quantization in the loop:
import torch.quantization as quant
# Prepare model for QAT
model.train()
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
quant.prepare_qat(model, inplace=True)
# Fine-tune for a few epochs with quantization simulation
for epoch in range(5):
for batch in train_loader:
output = model(batch)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
# Convert to actual quantized model
model.eval()
quantized_model = quant.convert(model)QAT recovers 50-80% of the accuracy lost by PTQ. Itβs more work but essential for precision-critical edge deployments.
INT4: The Frontier
INT4 quantization is aggressive. Each weight gets only 16 possible values. For LLMs, itβs surprisingly effective:
Llama 3.2 3B:
FP16: 5.9 perplexity, 6.0 GB
INT8: 6.0 perplexity, 3.0 GB (+0.1 perplexity)
INT4: 6.4 perplexity, 1.8 GB (+0.5 perplexity)For vision models, INT4 is riskier:
YOLOv8-medium (COCO mAP):
FP16: 50.2 mAP
INT8: 49.8 mAP (-0.4)
INT4: 47.1 mAP (-3.1) β noticeable in productionMy Quantization Decision Framework
Start with INT8 PTQ
β
Measure accuracy on YOUR validation set (not public benchmarks)
β
Accuracy drop < 1%? β Ship it
β
Accuracy drop 1-3%? β Try QAT, then re-measure
β
Accuracy drop > 3%? β Stay at FP16, get bigger hardware
β
Need INT4? β Only for LLMs. For vision, use INT8 + pruning insteadThe Validation Trap
Public benchmark accuracy β your production accuracy. Iβve seen models that lose 0.5% on ImageNet lose 4% on the clientβs specific product images. Always validate on production-representative data.
Build a validation pipeline:
# Run inference on 1000 production images with both models
python validate.py --model model_fp16.onnx --data prod_val/ > fp16_results.json
python validate.py --model model_int8.onnx --data prod_val/ > int8_results.json
python compare.py fp16_results.json int8_results.jsonIf the accuracy delta is acceptable, ship the quantized model. If not, you have the data to explain why you need better hardware.
Quantization is a tool, not a magic wand. Use it wisely.
