Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
RAG Pipeline on Kubernetes: Production Architecture Guide
AI

RAG Pipeline on Kubernetes: Production Architecture Guide

Deploy a production Retrieval-Augmented Generation pipeline on Kubernetes. Vector databases, embedding services, chunking strategies, and orchestration with Argo Workflows.

LB
Luca Berton
Β· 1 min read

Production RAG Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Kubernetes Cluster                    β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Ingress │───▢│ API      │───▢│ Orchestrator     β”‚  β”‚
β”‚  β”‚  (Kong)  β”‚    β”‚ Gateway  β”‚    β”‚ (Argo Workflows) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                           β”‚             β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”      β”‚
β”‚         β”‚                                 β”‚      β”‚      β”‚
β”‚         β–Ό                                 β–Ό      β–Ό      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Embedding   β”‚  β”‚   Vector DB  β”‚  β”‚    LLM     β”‚   β”‚
β”‚  β”‚  Service     β”‚  β”‚   (Qdrant)   β”‚  β”‚  (vLLM)    β”‚   β”‚
β”‚  β”‚  (TEI)       β”‚  β”‚              β”‚  β”‚            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Document    β”‚  β”‚  Reranker    β”‚  β”‚   Redis    β”‚   β”‚
β”‚  β”‚  Processor   β”‚  β”‚  (cross-enc) β”‚  β”‚  (cache)   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Breakdown

1. Document Ingestion Pipeline

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: document-ingestion
spec:
  entrypoint: ingest
  templates:
    - name: ingest
      steps:
        - - name: extract
            template: extract-text
        - - name: chunk
            template: chunk-documents
        - - name: embed
            template: generate-embeddings
        - - name: store
            template: store-vectors

    - name: extract-text
      container:
        image: unstructured-io/unstructured:latest
        args: ["--input-dir", "/data/raw", "--output-dir", "/data/extracted"]

    - name: chunk-documents
      container:
        image: myregistry/chunker:latest
        env:
          - name: CHUNK_SIZE
            value: "512"
          - name: CHUNK_OVERLAP
            value: "50"
          - name: STRATEGY
            value: "semantic"  # semantic > fixed-size > recursive

    - name: generate-embeddings
      container:
        image: ghcr.io/huggingface/text-embeddings-inference:latest
        resources:
          limits:
            nvidia.com/gpu: "1"

    - name: store-vectors
      container:
        image: myregistry/vector-store:latest
        env:
          - name: QDRANT_URL
            value: "http://qdrant.vector-db:6333"

2. Embedding Service (TEI)

HuggingFace Text Embeddings Inference β€” optimized for throughput:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-service
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: tei
          image: ghcr.io/huggingface/text-embeddings-inference:1.5
          args:
            - "--model-id"
            - "BAAI/bge-large-en-v1.5"
            - "--max-batch-tokens"
            - "16384"
            - "--max-concurrent-requests"
            - "512"
          ports:
            - containerPort: 80
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              memory: "4Gi"

3. Retrieval + Reranking Service

# retrieval_service.py
from fastapi import FastAPI
from qdrant_client import QdrantClient
from sentence_transformers import CrossEncoder

app = FastAPI()
qdrant = QdrantClient(host="qdrant.vector-db.svc", port=6333)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

@app.post("/retrieve")
async def retrieve(query: str, top_k: int = 10, rerank_top_k: int = 5):
    # Step 1: Embed query
    query_embedding = await embed(query)

    # Step 2: Vector search (fast, approximate)
    results = qdrant.search(
        collection_name="documents",
        query_vector=query_embedding,
        limit=top_k,
        score_threshold=0.7,
    )

    # Step 3: Rerank (precise, cross-encoder)
    passages = [r.payload["text"] for r in results]
    scores = reranker.predict([(query, p) for p in passages])

    # Step 4: Return top reranked results
    ranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)
    return [{"text": r.payload["text"], "score": s} for r, s in ranked[:rerank_top_k]]

4. Generation with Retrieved Context

@app.post("/generate")
async def generate(query: str):
    # Retrieve relevant context
    context_docs = await retrieve(query, top_k=10, rerank_top_k=5)
    context = "\n\n".join([d["text"] for d in context_docs])

    # Generate with LLM
    response = await vllm_client.chat.completions.create(
        model="meta-llama/Llama-3.1-70B-Instruct",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n{context}"},
            {"role": "user", "content": query}
        ],
        temperature=0.1,
        max_tokens=1024,
    )
    return {"answer": response.choices[0].message.content, "sources": context_docs}

Chunking Strategies

StrategyBest ForChunk Size
Fixed-sizeSimple documents512 tokens
RecursiveStructured text (markdown, code)512-1024 tokens
SemanticComplex documentsVariable
Sentence-windowQ&A systems3-5 sentences
Parent-childHierarchical docsParent: 2048, Child: 256

Semantic Chunking Example

from langchain.text_splitter import SemanticChunker
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

chunks = splitter.split_text(document_text)

Caching Layer

Redis caches both embeddings and LLM responses:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          args: ["--maxmemory", "4gb", "--maxmemory-policy", "allkeys-lru"]
          resources:
            limits:
              memory: "5Gi"
import hashlib, redis, json

cache = redis.Redis(host="redis-cache.ai-inference.svc")

def cached_retrieve(query: str):
    key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
    cached = cache.get(key)
    if cached:
        return json.loads(cached)
    result = retrieve(query)
    cache.setex(key, 3600, json.dumps(result))  # 1 hour TTL
    return result

Scaling Considerations

ComponentScaling StrategyBottleneck
Embedding serviceHorizontal (KEDA on queue depth)GPU compute
Vector DBSharding + replicasMemory + disk I/O
RerankerHorizontal (CPU or GPU)Cross-encoder inference
LLMHorizontal (KEDA on TTFT)GPU memory
CacheRedis ClusterMemory

Production Checklist

  • Document versioning (re-embed on update, not append)
  • Metadata filtering (date, source, category)
  • Hybrid search (vector + BM25 keyword)
  • Citation extraction (return source documents)
  • Guardrails (content filtering, hallucination detection)
  • A/B testing (chunking strategies, embedding models)
  • Monitoring (retrieval precision, generation quality)
  • Cost tracking (embedding + search + generation per query)

Free 30-min AI & Cloud consultation

Book Now