INT4, INT8, and When Accuracy Actually Drops

The Quantization Promise

“Just quantize your model to INT8 and deploy it on edge hardware.” I hear this weekly. And it’s true — mostly. But “mostly” can mean a 2% accuracy drop that costs a manufacturing client $50K/month in missed defects.

Let’s talk about when quantization works, when it breaks, and how to know the difference.

What Quantization Does

Neural network weights are stored as 32-bit floating point numbers (FP32). Quantization reduces precision:

FP32  → 32 bits per weight → Baseline accuracy
FP16  → 16 bits per weight → ~0% accuracy loss, 2× smaller
INT8  → 8 bits per weight  → 0-2% accuracy loss, 4× smaller
INT4  → 4 bits per weight  → 1-5% accuracy loss, 8× smaller

Smaller models = faster inference + less memory + lower power. The question is always: how much accuracy do you lose?

Post-Training Quantization (PTQ)

The simplest approach. Take a trained FP32 model, convert weights to INT8:

import torch
from torch.quantization import quantize_dynamic

# Load your FP32 model
model = torch.load('model_fp32.pth')

# Dynamic quantization (CPU inference)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},  # Which layers to quantize
    dtype=torch.qint8
)

# Save — 4x smaller
torch.save(quantized_model.state_dict(), 'model_int8.pth')

For ONNX models (recommended for edge):

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    'model.onnx',
    'model_int8.onnx',
    weight_type=QuantType.QInt8
)

When PTQ Works Well

Classification models (ResNet, EfficientNet)
Object detection (YOLO, SSD)
Simple NLP (sentiment analysis, classification)
Models with >10M parameters

When PTQ Breaks

Small models (<1M parameters) — not enough redundancy
Regression tasks — subtle numerical precision matters
Models with outlier weights — a few extreme values distort the quantization range

Quantization-Aware Training (QAT)

When PTQ accuracy drops too much, train with quantization in the loop:

import torch.quantization as quant

# Prepare model for QAT
model.train()
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
quant.prepare_qat(model, inplace=True)

# Fine-tune for a few epochs with quantization simulation
for epoch in range(5):
    for batch in train_loader:
        output = model(batch)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

# Convert to actual quantized model
model.eval()
quantized_model = quant.convert(model)

QAT recovers 50-80% of the accuracy lost by PTQ. It’s more work but essential for precision-critical edge deployments.

INT4: The Frontier

INT4 quantization is aggressive. Each weight gets only 16 possible values. For LLMs, it’s surprisingly effective:

Llama 3.2 3B:
  FP16:  5.9 perplexity,  6.0 GB
  INT8:  6.0 perplexity,  3.0 GB  (+0.1 perplexity)
  INT4:  6.4 perplexity,  1.8 GB  (+0.5 perplexity)

For vision models, INT4 is riskier:

YOLOv8-medium (COCO mAP):
  FP16:  50.2 mAP
  INT8:  49.8 mAP  (-0.4)
  INT4:  47.1 mAP  (-3.1)  ← noticeable in production

My Quantization Decision Framework

Start with INT8 PTQ
  ↓
Measure accuracy on YOUR validation set (not public benchmarks)
  ↓
Accuracy drop < 1%? → Ship it
  ↓
Accuracy drop 1-3%? → Try QAT, then re-measure
  ↓
Accuracy drop > 3%? → Stay at FP16, get bigger hardware
  ↓
Need INT4? → Only for LLMs. For vision, use INT8 + pruning instead

The Validation Trap

Public benchmark accuracy ≠ your production accuracy. I’ve seen models that lose 0.5% on ImageNet lose 4% on the client’s specific product images. Always validate on production-representative data.

Build a validation pipeline:

# Run inference on 1000 production images with both models
python validate.py --model model_fp16.onnx --data prod_val/ > fp16_results.json
python validate.py --model model_int8.onnx --data prod_val/ > int8_results.json
python compare.py fp16_results.json int8_results.json

If the accuracy delta is acceptable, ship the quantized model. If not, you have the data to explain why you need better hardware.

Quantization is a tool, not a magic wand. Use it wisely.

Model Quantization: INT4 and INT8 for Edge AI

The Quantization Promise

What Quantization Does

Post-Training Quantization (PTQ)

When PTQ Works Well

When PTQ Breaks

Quantization-Aware Training (QAT)

INT4: The Frontier

My Quantization Decision Framework

The Validation Trap

Related Articles

Reliable AI Agents in Java with LangChain4J — Workshop

AI Gateway on Kubernetes: Route and Load-Balance LLM Traffic

AI Model Serving on K8s: vLLM vs Triton vs NIM (2026)

AI Observability on Kubernetes: Monitor LLM Performance