Skip to main content
🎀 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
Model quantization edge AI
AI

Model Quantization: INT4 and INT8 for Edge AI

Quantizing models for edge deployment isn't free. Here's when INT4 and INT8 quantization works, when it doesn't, and how to measure the accuracy trade-off.

LB
Luca Berton
Β· 1 min read

The Quantization Promise

β€œJust quantize your model to INT8 and deploy it on edge hardware.” I hear this weekly. And it’s true β€” mostly. But β€œmostly” can mean a 2% accuracy drop that costs a manufacturing client $50K/month in missed defects.

Let’s talk about when quantization works, when it breaks, and how to know the difference.

What Quantization Does

Neural network weights are stored as 32-bit floating point numbers (FP32). Quantization reduces precision:

FP32  β†’ 32 bits per weight β†’ Baseline accuracy
FP16  β†’ 16 bits per weight β†’ ~0% accuracy loss, 2Γ— smaller
INT8  β†’ 8 bits per weight  β†’ 0-2% accuracy loss, 4Γ— smaller
INT4  β†’ 4 bits per weight  β†’ 1-5% accuracy loss, 8Γ— smaller

Smaller models = faster inference + less memory + lower power. The question is always: how much accuracy do you lose?

Post-Training Quantization (PTQ)

The simplest approach. Take a trained FP32 model, convert weights to INT8:

import torch
from torch.quantization import quantize_dynamic

# Load your FP32 model
model = torch.load('model_fp32.pth')

# Dynamic quantization (CPU inference)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},  # Which layers to quantize
    dtype=torch.qint8
)

# Save β€” 4x smaller
torch.save(quantized_model.state_dict(), 'model_int8.pth')

For ONNX models (recommended for edge):

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    'model.onnx',
    'model_int8.onnx',
    weight_type=QuantType.QInt8
)

When PTQ Works Well

  • Classification models (ResNet, EfficientNet)
  • Object detection (YOLO, SSD)
  • Simple NLP (sentiment analysis, classification)
  • Models with >10M parameters

When PTQ Breaks

  • Small models (<1M parameters) β€” not enough redundancy
  • Regression tasks β€” subtle numerical precision matters
  • Models with outlier weights β€” a few extreme values distort the quantization range

Quantization-Aware Training (QAT)

When PTQ accuracy drops too much, train with quantization in the loop:

import torch.quantization as quant

# Prepare model for QAT
model.train()
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
quant.prepare_qat(model, inplace=True)

# Fine-tune for a few epochs with quantization simulation
for epoch in range(5):
    for batch in train_loader:
        output = model(batch)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

# Convert to actual quantized model
model.eval()
quantized_model = quant.convert(model)

QAT recovers 50-80% of the accuracy lost by PTQ. It’s more work but essential for precision-critical edge deployments.

INT4: The Frontier

INT4 quantization is aggressive. Each weight gets only 16 possible values. For LLMs, it’s surprisingly effective:

Llama 3.2 3B:
  FP16:  5.9 perplexity,  6.0 GB
  INT8:  6.0 perplexity,  3.0 GB  (+0.1 perplexity)
  INT4:  6.4 perplexity,  1.8 GB  (+0.5 perplexity)

For vision models, INT4 is riskier:

YOLOv8-medium (COCO mAP):
  FP16:  50.2 mAP
  INT8:  49.8 mAP  (-0.4)
  INT4:  47.1 mAP  (-3.1)  ← noticeable in production

My Quantization Decision Framework

Start with INT8 PTQ
  ↓
Measure accuracy on YOUR validation set (not public benchmarks)
  ↓
Accuracy drop < 1%? β†’ Ship it
  ↓
Accuracy drop 1-3%? β†’ Try QAT, then re-measure
  ↓
Accuracy drop > 3%? β†’ Stay at FP16, get bigger hardware
  ↓
Need INT4? β†’ Only for LLMs. For vision, use INT8 + pruning instead

The Validation Trap

Public benchmark accuracy β‰  your production accuracy. I’ve seen models that lose 0.5% on ImageNet lose 4% on the client’s specific product images. Always validate on production-representative data.

Build a validation pipeline:

# Run inference on 1000 production images with both models
python validate.py --model model_fp16.onnx --data prod_val/ > fp16_results.json
python validate.py --model model_int8.onnx --data prod_val/ > int8_results.json
python compare.py fp16_results.json int8_results.json

If the accuracy delta is acceptable, ship the quantized model. If not, you have the data to explain why you need better hardware.

Quantization is a tool, not a magic wand. Use it wisely.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut