How I Cut My Client's OpenAI Bill by 80%

The $45,000/Month Problem

A client called me after their second month in production. Their AI-powered customer support tool was working great — customers loved it. But the OpenAI bill was $45,000/month and growing 15% week over week.

Here’s how we cut it to $9,000 without degrading quality.

Technique 1: Model Routing (Saved 40%)

Not every query needs GPT-4o. A simple classifier routes queries to the cheapest capable model:

class ModelRouter:
    MODELS = {
        'simple': {'name': 'gpt-4o-mini', 'cost_1k': 0.00075},
        'medium': {'name': 'gpt-4o', 'cost_1k': 0.0125},
        'complex': {'name': 'claude-sonnet-4', 'cost_1k': 0.018},
    }

    async def route(self, query: str, context: dict) -> str:
        # Simple heuristics first (free)
        if len(query.split()) < 20 and not context.get('requires_reasoning'):
            return 'simple'

        if any(kw in query.lower() for kw in ['compare', 'analyze', 'explain why']):
            return 'complex'

        # For ambiguous cases, use mini to classify
        classification = await self.classify_complexity(query)
        return classification

Result: 65% of queries routed to mini, 30% to GPT-4o, 5% to Claude for complex reasoning. Average cost per query dropped from $0.08 to $0.03.

Technique 2: Semantic Caching (Saved 25%)

Many customer support queries are semantically identical. “How do I reset my password?” and “I forgot my password, how to reset?” should return the same cached response:

import hashlib
from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = similarity_threshold
        self.cache = {}  # In production, use Redis

    async def get_or_generate(self, query, generate_fn):
        query_embedding = self.encoder.encode(query)

        # Check cache for similar queries
        for cached_key, (cached_embedding, cached_response) in self.cache.items():
            similarity = cosine_similarity(query_embedding, cached_embedding)
            if similarity > self.threshold:
                return cached_response  # Cache hit — $0 cost

        # Cache miss — generate and store
        response = await generate_fn(query)
        self.cache[query] = (query_embedding, response)
        return response

Cache hit rate after 2 weeks: 38%. That’s 38% of queries that cost zero tokens.

Technique 3: Prompt Compression (Saved 10%)

System prompts and context were bloated. We compressed without losing quality:

# Before: 1,200 tokens
SYSTEM_PROMPT_BEFORE = """
You are a helpful customer support agent for TechCorp.
You should always be polite, professional, and empathetic.
When helping customers, make sure to:
1. Greet them warmly
2. Understand their issue fully before responding
3. Provide step-by-step instructions when applicable
... (800 more tokens of instructions)
"""

# After: 280 tokens
SYSTEM_PROMPT_AFTER = """
TechCorp support agent. Be helpful, concise.
Rules: verify account before changes, escalate billing disputes,
link to docs when available. Format: numbered steps for how-tos.
Tone: professional, warm. Never share internal processes.
"""

We also stripped conversation history to last 5 turns instead of full history. Most support queries don’t need 20 turns of context.

Technique 4: Batch Processing (Saved 5%)

Non-real-time tasks (email classification, ticket routing) moved to OpenAI’s Batch API at 50% discount:

# Instead of real-time classification
# Batch process every 5 minutes
import openai

batch_requests = []
for ticket in unprocessed_tickets:
    batch_requests.append({
        "custom_id": ticket.id,
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Classify this support ticket."},
                {"role": "user", "content": ticket.text}
            ]
        }
    })

# Submit batch — 50% cheaper, results within 24h
batch = openai.batches.create(input=batch_requests)

The Results

Before optimization:
  Model: GPT-4o for everything
  Monthly queries: 450,000
  Monthly cost: $45,000
  Avg cost/query: $0.10

After optimization:
  Model routing: -40% ($27,000)
  Semantic caching: -25% ($20,250)
  Prompt compression: -10% ($18,225)
  Batch processing: -5% ($9,000)
  Final monthly cost: $9,000
  Avg cost/query: $0.02

FinOps for AI

AI cost optimization follows the same principles as cloud FinOps — which I cover for Kubernetes workloads at Kubernetes Recipes. The FinOps mindset applies:

Visibility — know what you’re spending, per feature
Optimization — right-size models like you right-size instances
Governance — set budgets and alerts before costs spiral

For automating cost monitoring and alerts across your AI infrastructure, Ansible playbooks can deploy Prometheus alerting rules that trigger when daily AI spend exceeds thresholds. I detail this pattern at Ansible Pilot.

Start Here

If your AI bill is growing faster than your revenue:

Add model routing (1 day of work, biggest impact)
Add semantic caching (2-3 days, compound returns)
Compress prompts (1 day, quick win)
Move batch workloads to Batch API (half a day)

Total effort: one sprint. ROI: usually 50-80% cost reduction. I’ve done this five times now — the results are consistently dramatic.

AI Cost Optimization: Cutting OpenAI Bills 80%

The $45,000/Month Problem

Technique 1: Model Routing (Saved 40%)

Technique 2: Semantic Caching (Saved 25%)

Technique 3: Prompt Compression (Saved 10%)

Technique 4: Batch Processing (Saved 5%)

The Results

FinOps for AI

Start Here

Related Articles

Reliable AI Agents in Java with LangChain4J — Workshop

AI Gateway on Kubernetes: Route and Load-Balance LLM Traffic

AI Model Serving on K8s: vLLM vs Triton vs NIM (2026)

AI Observability on Kubernetes: Monitor LLM Performance