The $45,000/Month Problem
A client called me after their second month in production. Their AI-powered customer support tool was working great — customers loved it. But the OpenAI bill was $45,000/month and growing 15% week over week.
Here’s how we cut it to $9,000 without degrading quality.
Technique 1: Model Routing (Saved 40%)
Not every query needs GPT-4o. A simple classifier routes queries to the cheapest capable model:
class ModelRouter:
MODELS = {
'simple': {'name': 'gpt-4o-mini', 'cost_1k': 0.00075},
'medium': {'name': 'gpt-4o', 'cost_1k': 0.0125},
'complex': {'name': 'claude-sonnet-4', 'cost_1k': 0.018},
}
async def route(self, query: str, context: dict) -> str:
# Simple heuristics first (free)
if len(query.split()) < 20 and not context.get('requires_reasoning'):
return 'simple'
if any(kw in query.lower() for kw in ['compare', 'analyze', 'explain why']):
return 'complex'
# For ambiguous cases, use mini to classify
classification = await self.classify_complexity(query)
return classificationResult: 65% of queries routed to mini, 30% to GPT-4o, 5% to Claude for complex reasoning. Average cost per query dropped from $0.08 to $0.03.
Technique 2: Semantic Caching (Saved 25%)
Many customer support queries are semantically identical. “How do I reset my password?” and “I forgot my password, how to reset?” should return the same cached response:
import hashlib
from sentence_transformers import SentenceTransformer
class SemanticCache:
def __init__(self, similarity_threshold=0.92):
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.threshold = similarity_threshold
self.cache = {} # In production, use Redis
async def get_or_generate(self, query, generate_fn):
query_embedding = self.encoder.encode(query)
# Check cache for similar queries
for cached_key, (cached_embedding, cached_response) in self.cache.items():
similarity = cosine_similarity(query_embedding, cached_embedding)
if similarity > self.threshold:
return cached_response # Cache hit — $0 cost
# Cache miss — generate and store
response = await generate_fn(query)
self.cache[query] = (query_embedding, response)
return responseCache hit rate after 2 weeks: 38%. That’s 38% of queries that cost zero tokens.
Technique 3: Prompt Compression (Saved 10%)
System prompts and context were bloated. We compressed without losing quality:
# Before: 1,200 tokens
SYSTEM_PROMPT_BEFORE = """
You are a helpful customer support agent for TechCorp.
You should always be polite, professional, and empathetic.
When helping customers, make sure to:
1. Greet them warmly
2. Understand their issue fully before responding
3. Provide step-by-step instructions when applicable
... (800 more tokens of instructions)
"""
# After: 280 tokens
SYSTEM_PROMPT_AFTER = """
TechCorp support agent. Be helpful, concise.
Rules: verify account before changes, escalate billing disputes,
link to docs when available. Format: numbered steps for how-tos.
Tone: professional, warm. Never share internal processes.
"""We also stripped conversation history to last 5 turns instead of full history. Most support queries don’t need 20 turns of context.
Technique 4: Batch Processing (Saved 5%)
Non-real-time tasks (email classification, ticket routing) moved to OpenAI’s Batch API at 50% discount:
# Instead of real-time classification
# Batch process every 5 minutes
import openai
batch_requests = []
for ticket in unprocessed_tickets:
batch_requests.append({
"custom_id": ticket.id,
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Classify this support ticket."},
{"role": "user", "content": ticket.text}
]
}
})
# Submit batch — 50% cheaper, results within 24h
batch = openai.batches.create(input=batch_requests)The Results
Before optimization:
Model: GPT-4o for everything
Monthly queries: 450,000
Monthly cost: $45,000
Avg cost/query: $0.10
After optimization:
Model routing: -40% ($27,000)
Semantic caching: -25% ($20,250)
Prompt compression: -10% ($18,225)
Batch processing: -5% ($9,000)
Final monthly cost: $9,000
Avg cost/query: $0.02FinOps for AI
AI cost optimization follows the same principles as cloud FinOps — which I cover for Kubernetes workloads at Kubernetes Recipes. The FinOps mindset applies:
- Visibility — know what you’re spending, per feature
- Optimization — right-size models like you right-size instances
- Governance — set budgets and alerts before costs spiral
For automating cost monitoring and alerts across your AI infrastructure, Ansible playbooks can deploy Prometheus alerting rules that trigger when daily AI spend exceeds thresholds. I detail this pattern at Ansible Pilot.
Start Here
If your AI bill is growing faster than your revenue:
- Add model routing (1 day of work, biggest impact)
- Add semantic caching (2-3 days, compound returns)
- Compress prompts (1 day, quick win)
- Move batch workloads to Batch API (half a day)
Total effort: one sprint. ROI: usually 50-80% cost reduction. I’ve done this five times now — the results are consistently dramatic.
