Skip to main content
🎀 Speaking at KubeCon EU 2026 Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AI View Session
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
AI

Retrieval-Augmented Generation at Scale: Architecture Patterns for Enterprise RAG

Luca Berton β€’ β€’ 2 min read
#ai#rag#architecture#enterprise#llm

πŸ” RAG Beyond the Demo

Every AI demo uses RAG. Few do it well in production. The gap between a prototype that queries a vector database and an enterprise system that serves thousands of users with accurate, sourced answers is enormous.

Here’s what I’ve learned deploying RAG systems at scale.

Architecture Overview

Documents β†’ Ingestion Pipeline β†’ Vector DB + Metadata Store
                                        ↓
User Query β†’ Query Planner β†’ Hybrid Search β†’ Reranker β†’ LLM β†’ Response
                                                              ↓
                                                        Citation Check

Chunking Strategies That Actually Work

The most impactful decision in RAG isn’t your model or vector database β€” it’s how you chunk documents.

Semantic Chunking

Instead of fixed-size chunks, split on semantic boundaries:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Bad: Fixed-size chunks break context
bad_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

# Better: Respect document structure
good_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)

Hierarchical Chunking

Store chunks at multiple granularities:

class HierarchicalChunker:
    def chunk(self, document):
        # Level 1: Full sections (for context)
        sections = self.split_by_headings(document)
        
        # Level 2: Paragraphs (for retrieval)
        paragraphs = []
        for section in sections:
            for para in self.split_paragraphs(section):
                para.metadata["parent_section"] = section.id
                paragraphs.append(para)
        
        return sections, paragraphs

When a paragraph matches, retrieve its parent section for full context. This dramatically improves answer quality.

Metadata Enrichment

Every chunk needs rich metadata:

chunk_metadata = {
    "source": "architecture-guide-v3.pdf",
    "page": 42,
    "section": "Security Requirements",
    "author": "Platform Team",
    "last_updated": "2026-01-15",
    "document_type": "technical_spec",
    "access_level": "internal",
    "chunk_index": 7,
    "total_chunks": 23,
}

Vector Database Selection

DatabaseBest ForKubernetes-NativeHybrid Search
MilvusLarge scale (100M+ vectors)βœ… (Helm chart)βœ…
QdrantMid-scale, filteringβœ… (Operator)βœ…
WeaviateMulti-modal, GraphQLβœ… (Helm)βœ…
pgvectorSmall scale, existing PostgresVia operatorLimited
ChromaDBPrototypes only❌❌

For enterprise Kubernetes deployments, I recommend Milvus for scale or Qdrant for simplicity.

Hybrid Search: The Secret Weapon

Pure vector search misses exact matches. Pure keyword search misses semantics. Combine both:

async def hybrid_search(query: str, top_k: int = 20) -> list[Chunk]:
    # Semantic search
    embedding = await embed(query)
    vector_results = await vector_db.search(embedding, top_k=top_k)
    
    # Keyword search (BM25)
    keyword_results = await search_engine.search(query, top_k=top_k)
    
    # Reciprocal Rank Fusion
    combined = reciprocal_rank_fusion(
        [vector_results, keyword_results],
        k=60,
    )
    
    return combined[:top_k]

def reciprocal_rank_fusion(result_lists, k=60):
    scores = {}
    for results in result_lists:
        for rank, doc in enumerate(results):
            if doc.id not in scores:
                scores[doc.id] = 0
            scores[doc.id] += 1 / (k + rank + 1)
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Reranking

Initial retrieval casts a wide net. A reranker picks the best results:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank(query: str, chunks: list[Chunk], top_k: int = 5) -> list[Chunk]:
    pairs = [(query, chunk.text) for chunk in chunks]
    scores = reranker.predict(pairs)
    
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, score in ranked[:top_k]]

Scaling on Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-api
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: rag-service
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
        env:
        - name: VECTOR_DB_HOST
          value: "milvus.vector-db.svc.cluster.local"
        - name: RERANKER_MODEL
          value: "cross-encoder/ms-marco-MiniLM-L-12-v2"
        - name: CACHE_TTL_SECONDS
          value: "3600"
---
# HPA for automatic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rag-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rag-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Evaluation: How to Know Your RAG Works

You can’t improve what you don’t measure:

class RAGEvaluator:
    async def evaluate(self, test_set: list[QAPair]):
        results = []
        for qa in test_set:
            response = await rag_pipeline.query(qa.question)
            
            results.append({
                "faithfulness": self.check_faithfulness(response, qa.context),
                "relevance": self.check_relevance(response, qa.question),
                "correctness": self.check_correctness(response, qa.answer),
                "has_citations": bool(response.citations),
                "latency_ms": response.latency_ms,
            })
        
        return aggregate_metrics(results)

Track these metrics weekly. RAG quality degrades as your document corpus changes.

Key Takeaways

  1. Chunking is everything β€” invest time in semantic, hierarchical chunking with rich metadata
  2. Hybrid search beats vector-only β€” combine BM25 and embeddings with reciprocal rank fusion
  3. Always rerank β€” a lightweight cross-encoder dramatically improves precision
  4. Evaluate continuously β€” build a test set and track faithfulness, relevance, and correctness
  5. Cache aggressively β€” similar queries get similar results; cache at the embedding and response level

Building an enterprise RAG system? I help organizations design production-grade retrieval architectures. Get in touch.

Share:

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens TechMeOut