What AI and cloud consulting services does Luca Berton offer?

Luca Berton provides expert consulting in AI/ML platform strategy, multi-tenant GPU orchestration on OpenShift AI, MLOps enablement, cloud infrastructure design, Kubernetes workshops, and Ansible & Python training.

What is Ansible Pilot?

Ansible Pilot is the leading resource for Ansible automation learning, featuring a YouTube channel with 6.1K subscribers and 1M+ views, plus AnsiblePilot.com with 648K total users.

How can I book a consultation with Luca Berton?

Schedule a free consultation through Calendly at calendly.com/lucaberton or visit lucaberton.com/contact.

Model Quantization for Edge AI: INT4, INT8, and When Accuracy Actually Drops

Luca Berton • Thu Feb 26 2026 • 1 min read •

#edge-ai#quantization#model-optimization#int8#int4#deployment

The Quantization Promise

“Just quantize your model to INT8 and deploy it on edge hardware.” I hear this weekly. And it’s true — mostly. But “mostly” can mean a 2% accuracy drop that costs a manufacturing client $50K/month in missed defects.

Let’s talk about when quantization works, when it breaks, and how to know the difference.

What Quantization Does

Neural network weights are stored as 32-bit floating point numbers (FP32). Quantization reduces precision:

FP32  → 32 bits per weight → Baseline accuracy
FP16  → 16 bits per weight → ~0% accuracy loss, 2× smaller
INT8  → 8 bits per weight  → 0-2% accuracy loss, 4× smaller
INT4  → 4 bits per weight  → 1-5% accuracy loss, 8× smaller

Smaller models = faster inference + less memory + lower power. The question is always: how much accuracy do you lose?

Post-Training Quantization (PTQ)

The simplest approach. Take a trained FP32 model, convert weights to INT8:

import torch
from torch.quantization import quantize_dynamic

# Load your FP32 model
model = torch.load('model_fp32.pth')

# Dynamic quantization (CPU inference)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},  # Which layers to quantize
    dtype=torch.qint8
)

# Save — 4x smaller
torch.save(quantized_model.state_dict(), 'model_int8.pth')

For ONNX models (recommended for edge):

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    'model.onnx',
    'model_int8.onnx',
    weight_type=QuantType.QInt8
)

When PTQ Works Well

Classification models (ResNet, EfficientNet)
Object detection (YOLO, SSD)
Simple NLP (sentiment analysis, classification)
Models with >10M parameters

When PTQ Breaks

Small models (<1M parameters) — not enough redundancy
Regression tasks — subtle numerical precision matters
Models with outlier weights — a few extreme values distort the quantization range

Quantization-Aware Training (QAT)

When PTQ accuracy drops too much, train with quantization in the loop:

import torch.quantization as quant

# Prepare model for QAT
model.train()
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
quant.prepare_qat(model, inplace=True)

# Fine-tune for a few epochs with quantization simulation
for epoch in range(5):
    for batch in train_loader:
        output = model(batch)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

# Convert to actual quantized model
model.eval()
quantized_model = quant.convert(model)

QAT recovers 50-80% of the accuracy lost by PTQ. It’s more work but essential for precision-critical edge deployments.

INT4: The Frontier

INT4 quantization is aggressive. Each weight gets only 16 possible values. For LLMs, it’s surprisingly effective:

Llama 3.2 3B:
  FP16:  5.9 perplexity,  6.0 GB
  INT8:  6.0 perplexity,  3.0 GB  (+0.1 perplexity)
  INT4:  6.4 perplexity,  1.8 GB  (+0.5 perplexity)

For vision models, INT4 is riskier:

YOLOv8-medium (COCO mAP):
  FP16:  50.2 mAP
  INT8:  49.8 mAP  (-0.4)
  INT4:  47.1 mAP  (-3.1)  ← noticeable in production

My Quantization Decision Framework

Start with INT8 PTQ
  ↓
Measure accuracy on YOUR validation set (not public benchmarks)
  ↓
Accuracy drop < 1%? → Ship it
  ↓
Accuracy drop 1-3%? → Try QAT, then re-measure
  ↓
Accuracy drop > 3%? → Stay at FP16, get bigger hardware
  ↓
Need INT4? → Only for LLMs. For vision, use INT8 + pruning instead

The Validation Trap

Public benchmark accuracy ≠ your production accuracy. I’ve seen models that lose 0.5% on ImageNet lose 4% on the client’s specific product images. Always validate on production-representative data.

Build a validation pipeline:

# Run inference on 1000 production images with both models
python validate.py --model model_fp16.onnx --data prod_val/ > fp16_results.json
python validate.py --model model_int8.onnx --data prod_val/ > int8_results.json
python compare.py fp16_results.json int8_results.json

If the accuracy delta is acceptable, ship the quantized model. If not, you have the data to explain why you need better hardware.

Quantization is a tool, not a magic wand. Use it wisely.

📌 Need expert help with this topic?

🧠

AI Integration & GPU Platforms

Need help deploying AI/ML platforms? Get expert consulting on OpenShift AI, GPU orchestration, and MLOps.

☸️

Kubernetes & Containerization

Master Kubernetes and container orchestration with hands-on workshops and architecture consulting.

Book a free consultation →

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

LinkedIn Bluesky YouTube Contact →

← Back to Blog

JSON vs TOON for AI Input: Token-Efficient Data for LLMs

Compare JSON and TOON (Token-Oriented Object Notation) for feeding structured data to Large Language Models. See how TOON cuts token counts by up to 50 percent while keeping JSON compatibility.

Tue Mar 03 2026

Building Custom AI Skills with InstructLab Taxonomy

Create domain-specific AI capabilities using InstructLab's taxonomy system—from writing skill definitions to generating synthetic training data and validating fine-tuned models.

Mon Mar 02 2026

Accessing the OpenClaw Control UI Dashboard on Azure

How to access the OpenClaw Control UI dashboard from an Azure VM — via SSH tunnel (secure) or public IP. Covers device pairing, dashboard authentication, and the browser-based management interface.

Thu Feb 26 2026