Skip to main content
🎤 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎤 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
AI

Model Quantization for Edge AI: INT4, INT8, and When Accuracy Actually Drops

Luca Berton 1 min read
#edge-ai#quantization#model-optimization#int8#int4#deployment

The Quantization Promise

“Just quantize your model to INT8 and deploy it on edge hardware.” I hear this weekly. And it’s true — mostly. But “mostly” can mean a 2% accuracy drop that costs a manufacturing client $50K/month in missed defects.

Let’s talk about when quantization works, when it breaks, and how to know the difference.

What Quantization Does

Neural network weights are stored as 32-bit floating point numbers (FP32). Quantization reduces precision:

FP32  → 32 bits per weight → Baseline accuracy
FP16  → 16 bits per weight → ~0% accuracy loss, 2× smaller
INT8  → 8 bits per weight  → 0-2% accuracy loss, 4× smaller
INT4  → 4 bits per weight  → 1-5% accuracy loss, 8× smaller

Smaller models = faster inference + less memory + lower power. The question is always: how much accuracy do you lose?

Post-Training Quantization (PTQ)

The simplest approach. Take a trained FP32 model, convert weights to INT8:

import torch
from torch.quantization import quantize_dynamic

# Load your FP32 model
model = torch.load('model_fp32.pth')

# Dynamic quantization (CPU inference)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},  # Which layers to quantize
    dtype=torch.qint8
)

# Save — 4x smaller
torch.save(quantized_model.state_dict(), 'model_int8.pth')

For ONNX models (recommended for edge):

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    'model.onnx',
    'model_int8.onnx',
    weight_type=QuantType.QInt8
)

When PTQ Works Well

  • Classification models (ResNet, EfficientNet)
  • Object detection (YOLO, SSD)
  • Simple NLP (sentiment analysis, classification)
  • Models with >10M parameters

When PTQ Breaks

  • Small models (<1M parameters) — not enough redundancy
  • Regression tasks — subtle numerical precision matters
  • Models with outlier weights — a few extreme values distort the quantization range

Quantization-Aware Training (QAT)

When PTQ accuracy drops too much, train with quantization in the loop:

import torch.quantization as quant

# Prepare model for QAT
model.train()
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
quant.prepare_qat(model, inplace=True)

# Fine-tune for a few epochs with quantization simulation
for epoch in range(5):
    for batch in train_loader:
        output = model(batch)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

# Convert to actual quantized model
model.eval()
quantized_model = quant.convert(model)

QAT recovers 50-80% of the accuracy lost by PTQ. It’s more work but essential for precision-critical edge deployments.

INT4: The Frontier

INT4 quantization is aggressive. Each weight gets only 16 possible values. For LLMs, it’s surprisingly effective:

Llama 3.2 3B:
  FP16:  5.9 perplexity,  6.0 GB
  INT8:  6.0 perplexity,  3.0 GB  (+0.1 perplexity)
  INT4:  6.4 perplexity,  1.8 GB  (+0.5 perplexity)

For vision models, INT4 is riskier:

YOLOv8-medium (COCO mAP):
  FP16:  50.2 mAP
  INT8:  49.8 mAP  (-0.4)
  INT4:  47.1 mAP  (-3.1)  ← noticeable in production

My Quantization Decision Framework

Start with INT8 PTQ

Measure accuracy on YOUR validation set (not public benchmarks)

Accuracy drop < 1%? → Ship it

Accuracy drop 1-3%? → Try QAT, then re-measure

Accuracy drop > 3%? → Stay at FP16, get bigger hardware

Need INT4? → Only for LLMs. For vision, use INT8 + pruning instead

The Validation Trap

Public benchmark accuracy ≠ your production accuracy. I’ve seen models that lose 0.5% on ImageNet lose 4% on the client’s specific product images. Always validate on production-representative data.

Build a validation pipeline:

# Run inference on 1000 production images with both models
python validate.py --model model_fp16.onnx --data prod_val/ > fp16_results.json
python validate.py --model model_int8.onnx --data prod_val/ > int8_results.json
python compare.py fp16_results.json int8_results.json

If the accuracy delta is acceptable, ship the quantized model. If not, you have the data to explain why you need better hardware.

Quantization is a tool, not a magic wand. Use it wisely.

Share:

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut