Skip to main content
🎤 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎤 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
AI

Fine-Tuning vs RAG vs Prompt Engineering: The Decision Framework

Luca Berton 2 min read
#fine-tuning#rag#prompt-engineering#llm#ai-strategy#decision-framework

The Most Common AI Architecture Question

“Should we fine-tune, build RAG, or just prompt better?” I get asked this in every AI consulting engagement. The answer depends on your data, your budget, and your accuracy requirements.

The Quick Decision Tree

Is your knowledge in public documentation?
  YES → Use Context7 or similar. Done.
  NO ↓

Does the knowledge change frequently (weekly+)?
  YES → RAG
  NO ↓

Do you need the model to behave differently (tone, format, domain expertise)?
  YES → Fine-tuning
  NO ↓

Is accuracy critical (>95% required)?
  YES → RAG + fine-tuning (hybrid)
  NO → Prompt engineering

Detailed Comparison

Prompt Engineering

What it is: Crafting system prompts, few-shot examples, and structured instructions.

Best for:

  • General-purpose tasks
  • Prototyping and MVPs
  • Tasks where the base model already knows the domain

Cost: $0 upfront, pay-per-token at inference

Example:

SYSTEM_PROMPT = """You are a Kubernetes troubleshooting assistant.
When diagnosing issues:
1. Check the pod status first
2. Review events and logs
3. Suggest the most likely root cause
4. Provide the exact kubectl command to fix it

Format: use markdown with code blocks for commands."""

Limitation: Context window is finite. You can’t stuff an entire knowledge base into a prompt.

RAG (Retrieval-Augmented Generation)

What it is: Retrieve relevant documents at query time and include them in the prompt.

Best for:

  • Large, frequently updated knowledge bases
  • When you need citations/sources
  • Compliance requirements (traceable answers)

Cost: Vector DB hosting ($50-500/month), embedding costs, engineering time

Architecture:

async def rag_query(question):
    # 1. Embed the question
    query_embedding = await embed(question)

    # 2. Retrieve relevant chunks
    chunks = await vector_db.search(query_embedding, top_k=5)

    # 3. Build context
    context = "\n\n".join([c.text for c in chunks])

    # 4. Generate answer with context
    response = await llm.generate(
        system="Answer based on the provided context. Cite sources.",
        user=f"Context:\n{context}\n\nQuestion: {question}"
    )
    return response, chunks  # Return sources for transparency

I manage the infrastructure for these pipelines using Ansible — automating the vector DB deployment, embedding service, and retrieval API. See Ansible Pilot for the infrastructure-as-code patterns.

Fine-Tuning

What it is: Training the model on your specific data to change its behavior permanently.

Best for:

  • Domain-specific language (legal, medical, financial)
  • Consistent output format
  • Reducing token usage (fine-tuned models need shorter prompts)
  • When RAG retrieval quality is insufficient

Cost: $50-5,000 per training run, requires labeled data

Example (OpenAI fine-tuning format):

{"messages": [{"role": "system", "content": "You are a K8s expert."}, {"role": "user", "content": "Pod stuck in CrashLoopBackOff"}, {"role": "assistant", "content": "Check the container logs with `kubectl logs <pod> --previous`. Common causes: 1) Application crash on startup..."}]}
{"messages": [{"role": "system", "content": "You are a K8s expert."}, {"role": "user", "content": "ImagePullBackOff error"}, {"role": "assistant", "content": "The cluster can't pull the container image. Verify: 1) Image name and tag are correct..."}]}

The Hybrid Approach (What I Actually Recommend)

In practice, the best production systems combine all three:

Layer 1: Fine-tuned base model
  → Knows your domain, speaks your language

Layer 2: RAG for dynamic knowledge
  → Retrieves current documentation, tickets, runbooks

Layer 3: Prompt engineering for task-specific behavior
  → Structures output format, enforces constraints

For Kubernetes-related AI assistants, I’ve found that a fine-tuned model + RAG over the latest K8s docs (via Kubernetes Recipes) + structured prompts delivers the best results.

Cost Comparison (Annual, 10K Queries/Day)

                    Prompt Only    RAG           Fine-Tune     Hybrid
Upfront             $0            $5,000        $2,000        $7,000
Monthly infra       $0            $300          $0            $300
Token cost/month    $1,200        $800          $600          $500
Annual total        $14,400       $16,100       $9,200        $13,600
Accuracy            75-85%        85-92%        88-93%        93-97%
Maintenance         Low           Medium        Low           Medium

The hybrid approach costs slightly more than fine-tuning alone but delivers significantly better accuracy. For enterprise clients where accuracy matters, it’s the right trade-off.

When to Start Simple

My advice: always start with prompt engineering. Build the MVP, measure accuracy, identify failure cases. Only then decide:

  • Failures due to missing knowledge → Add RAG
  • Failures due to wrong behavior/format → Add fine-tuning
  • Failures due to outdated library docs → Add Context7

Don’t over-engineer from day one. Let the failure modes guide your architecture.

Share:

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut