RAG Pipeline on Kubernetes: Production Architecture Guide

Production RAG Architecture

┌─────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                    │
│                                                         │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────┐  │
│  │  Ingress │───▶│ API      │───▶│ Orchestrator     │  │
│  │  (Kong)  │    │ Gateway  │    │ (Argo Workflows) │  │
│  └──────────┘    └──────────┘    └────────┬─────────┘  │
│                                           │             │
│         ┌─────────────────────────────────┼──────┐      │
│         │                                 │      │      │
│         ▼                                 ▼      ▼      │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐   │
│  │  Embedding   │  │   Vector DB  │  │    LLM     │   │
│  │  Service     │  │   (Qdrant)   │  │  (vLLM)    │   │
│  │  (TEI)       │  │              │  │            │   │
│  └──────────────┘  └──────────────┘  └────────────┘   │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐   │
│  │  Document    │  │  Reranker    │  │   Redis    │   │
│  │  Processor   │  │  (cross-enc) │  │  (cache)   │   │
│  └──────────────┘  └──────────────┘  └────────────┘   │
└─────────────────────────────────────────────────────────┘

Component Breakdown

1. Document Ingestion Pipeline

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: document-ingestion
spec:
  entrypoint: ingest
  templates:
    - name: ingest
      steps:
        - - name: extract
            template: extract-text
        - - name: chunk
            template: chunk-documents
        - - name: embed
            template: generate-embeddings
        - - name: store
            template: store-vectors

    - name: extract-text
      container:
        image: unstructured-io/unstructured:latest
        args: ["--input-dir", "/data/raw", "--output-dir", "/data/extracted"]

    - name: chunk-documents
      container:
        image: myregistry/chunker:latest
        env:
          - name: CHUNK_SIZE
            value: "512"
          - name: CHUNK_OVERLAP
            value: "50"
          - name: STRATEGY
            value: "semantic"  # semantic > fixed-size > recursive

    - name: generate-embeddings
      container:
        image: ghcr.io/huggingface/text-embeddings-inference:latest
        resources:
          limits:
            nvidia.com/gpu: "1"

    - name: store-vectors
      container:
        image: myregistry/vector-store:latest
        env:
          - name: QDRANT_URL
            value: "http://qdrant.vector-db:6333"

2. Embedding Service (TEI)

HuggingFace Text Embeddings Inference — optimized for throughput:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-service
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: tei
          image: ghcr.io/huggingface/text-embeddings-inference:1.5
          args:
            - "--model-id"
            - "BAAI/bge-large-en-v1.5"
            - "--max-batch-tokens"
            - "16384"
            - "--max-concurrent-requests"
            - "512"
          ports:
            - containerPort: 80
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              memory: "4Gi"

3. Retrieval + Reranking Service

# retrieval_service.py
from fastapi import FastAPI
from qdrant_client import QdrantClient
from sentence_transformers import CrossEncoder

app = FastAPI()
qdrant = QdrantClient(host="qdrant.vector-db.svc", port=6333)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

@app.post("/retrieve")
async def retrieve(query: str, top_k: int = 10, rerank_top_k: int = 5):
    # Step 1: Embed query
    query_embedding = await embed(query)

    # Step 2: Vector search (fast, approximate)
    results = qdrant.search(
        collection_name="documents",
        query_vector=query_embedding,
        limit=top_k,
        score_threshold=0.7,
    )

    # Step 3: Rerank (precise, cross-encoder)
    passages = [r.payload["text"] for r in results]
    scores = reranker.predict([(query, p) for p in passages])

    # Step 4: Return top reranked results
    ranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)
    return [{"text": r.payload["text"], "score": s} for r, s in ranked[:rerank_top_k]]

4. Generation with Retrieved Context

@app.post("/generate")
async def generate(query: str):
    # Retrieve relevant context
    context_docs = await retrieve(query, top_k=10, rerank_top_k=5)
    context = "\n\n".join([d["text"] for d in context_docs])

    # Generate with LLM
    response = await vllm_client.chat.completions.create(
        model="meta-llama/Llama-3.1-70B-Instruct",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n{context}"},
            {"role": "user", "content": query}
        ],
        temperature=0.1,
        max_tokens=1024,
    )
    return {"answer": response.choices[0].message.content, "sources": context_docs}

Chunking Strategies

Strategy	Best For	Chunk Size
Fixed-size	Simple documents	512 tokens
Recursive	Structured text (markdown, code)	512-1024 tokens
Semantic	Complex documents	Variable
Sentence-window	Q&A systems	3-5 sentences
Parent-child	Hierarchical docs	Parent: 2048, Child: 256

Semantic Chunking Example

from langchain.text_splitter import SemanticChunker
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

chunks = splitter.split_text(document_text)

Caching Layer

Redis caches both embeddings and LLM responses:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          args: ["--maxmemory", "4gb", "--maxmemory-policy", "allkeys-lru"]
          resources:
            limits:
              memory: "5Gi"

import hashlib, redis, json

cache = redis.Redis(host="redis-cache.ai-inference.svc")

def cached_retrieve(query: str):
    key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
    cached = cache.get(key)
    if cached:
        return json.loads(cached)
    result = retrieve(query)
    cache.setex(key, 3600, json.dumps(result))  # 1 hour TTL
    return result

Scaling Considerations

Component	Scaling Strategy	Bottleneck
Embedding service	Horizontal (KEDA on queue depth)	GPU compute
Vector DB	Sharding + replicas	Memory + disk I/O
Reranker	Horizontal (CPU or GPU)	Cross-encoder inference
LLM	Horizontal (KEDA on TTFT)	GPU memory
Cache	Redis Cluster	Memory

Production Checklist

Document versioning (re-embed on update, not append)
Metadata filtering (date, source, category)
Hybrid search (vector + BM25 keyword)
Citation extraction (return source documents)
Guardrails (content filtering, hallucination detection)
A/B testing (chunking strategies, embedding models)
Monitoring (retrieval precision, generation quality)
Cost tracking (embedding + search + generation per query)

RAG Pipeline on Kubernetes: Production Architecture Guide

Production RAG Architecture

Component Breakdown

1. Document Ingestion Pipeline

2. Embedding Service (TEI)

3. Retrieval + Reranking Service

4. Generation with Retrieved Context

Chunking Strategies

Semantic Chunking Example

Caching Layer

Scaling Considerations

Production Checklist

Related Articles

LinkedIn Has the Most AI Slop. That's Actually an Opportunity.

What 'Agent Engineering Platform' Actually Means for Production AI

The Spec Layer: Why AI Agents Need Structured Intent, Not Vibes

Google's AI Evolution: Maps, Photos, Chrome, and Project Genie