What AI and cloud consulting services does Luca Berton offer?

Luca Berton provides expert consulting in AI/ML platform strategy, multi-tenant GPU orchestration on OpenShift AI, MLOps enablement, cloud infrastructure design, Kubernetes workshops, and Ansible & Python training.

What is Ansible Pilot?

Ansible Pilot is the leading resource for Ansible automation learning, featuring a YouTube channel with 6.1K subscribers and 1M+ views, plus AnsiblePilot.com with 648K total users.

How can I book a consultation with Luca Berton?

Schedule a free consultation through Calendly at calendly.com/lucaberton or visit lucaberton.com/contact.

Retrieval-Augmented Generation at Scale: Architecture Patterns for Enterprise RAG

Luca Berton • Thu Feb 26 2026 • 2 min read •

#ai#rag#architecture#enterprise#llm

🔍 RAG Beyond the Demo

Every AI demo uses RAG. Few do it well in production. The gap between a prototype that queries a vector database and an enterprise system that serves thousands of users with accurate, sourced answers is enormous.

Here’s what I’ve learned deploying RAG systems at scale.

Architecture Overview

Documents → Ingestion Pipeline → Vector DB + Metadata Store
                                        ↓
User Query → Query Planner → Hybrid Search → Reranker → LLM → Response
                                                              ↓
                                                        Citation Check

Chunking Strategies That Actually Work

The most impactful decision in RAG isn’t your model or vector database — it’s how you chunk documents.

Semantic Chunking

Instead of fixed-size chunks, split on semantic boundaries:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Bad: Fixed-size chunks break context
bad_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

# Better: Respect document structure
good_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)

Hierarchical Chunking

Store chunks at multiple granularities:

class HierarchicalChunker:
    def chunk(self, document):
        # Level 1: Full sections (for context)
        sections = self.split_by_headings(document)
        
        # Level 2: Paragraphs (for retrieval)
        paragraphs = []
        for section in sections:
            for para in self.split_paragraphs(section):
                para.metadata["parent_section"] = section.id
                paragraphs.append(para)
        
        return sections, paragraphs

When a paragraph matches, retrieve its parent section for full context. This dramatically improves answer quality.

Metadata Enrichment

Every chunk needs rich metadata:

chunk_metadata = {
    "source": "architecture-guide-v3.pdf",
    "page": 42,
    "section": "Security Requirements",
    "author": "Platform Team",
    "last_updated": "2026-01-15",
    "document_type": "technical_spec",
    "access_level": "internal",
    "chunk_index": 7,
    "total_chunks": 23,
}

Vector Database Selection

Database	Best For	Kubernetes-Native	Hybrid Search
Milvus	Large scale (100M+ vectors)	✅ (Helm chart)	✅
Qdrant	Mid-scale, filtering	✅ (Operator)	✅
Weaviate	Multi-modal, GraphQL	✅ (Helm)	✅
pgvector	Small scale, existing Postgres	Via operator	Limited
ChromaDB	Prototypes only	❌	❌

For enterprise Kubernetes deployments, I recommend Milvus for scale or Qdrant for simplicity.

Hybrid Search: The Secret Weapon

Pure vector search misses exact matches. Pure keyword search misses semantics. Combine both:

async def hybrid_search(query: str, top_k: int = 20) -> list[Chunk]:
    # Semantic search
    embedding = await embed(query)
    vector_results = await vector_db.search(embedding, top_k=top_k)
    
    # Keyword search (BM25)
    keyword_results = await search_engine.search(query, top_k=top_k)
    
    # Reciprocal Rank Fusion
    combined = reciprocal_rank_fusion(
        [vector_results, keyword_results],
        k=60,
    )
    
    return combined[:top_k]

def reciprocal_rank_fusion(result_lists, k=60):
    scores = {}
    for results in result_lists:
        for rank, doc in enumerate(results):
            if doc.id not in scores:
                scores[doc.id] = 0
            scores[doc.id] += 1 / (k + rank + 1)
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Reranking

Initial retrieval casts a wide net. A reranker picks the best results:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank(query: str, chunks: list[Chunk], top_k: int = 5) -> list[Chunk]:
    pairs = [(query, chunk.text) for chunk in chunks]
    scores = reranker.predict(pairs)
    
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, score in ranked[:top_k]]

Scaling on Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-api
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: rag-service
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
        env:
        - name: VECTOR_DB_HOST
          value: "milvus.vector-db.svc.cluster.local"
        - name: RERANKER_MODEL
          value: "cross-encoder/ms-marco-MiniLM-L-12-v2"
        - name: CACHE_TTL_SECONDS
          value: "3600"
---
# HPA for automatic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rag-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rag-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Evaluation: How to Know Your RAG Works

You can’t improve what you don’t measure:

class RAGEvaluator:
    async def evaluate(self, test_set: list[QAPair]):
        results = []
        for qa in test_set:
            response = await rag_pipeline.query(qa.question)
            
            results.append({
                "faithfulness": self.check_faithfulness(response, qa.context),
                "relevance": self.check_relevance(response, qa.question),
                "correctness": self.check_correctness(response, qa.answer),
                "has_citations": bool(response.citations),
                "latency_ms": response.latency_ms,
            })
        
        return aggregate_metrics(results)

Track these metrics weekly. RAG quality degrades as your document corpus changes.

Key Takeaways

Chunking is everything — invest time in semantic, hierarchical chunking with rich metadata
Hybrid search beats vector-only — combine BM25 and embeddings with reciprocal rank fusion
Always rerank — a lightweight cross-encoder dramatically improves precision
Evaluate continuously — build a test set and track faithfulness, relevance, and correctness
Cache aggressively — similar queries get similar results; cache at the embedding and response level

Building an enterprise RAG system? I help organizations design production-grade retrieval architectures. Get in touch.

📌 Need expert help with this topic?

🧠

AI Integration & GPU Platforms

Need help deploying AI/ML platforms? Get expert consulting on OpenShift AI, GPU orchestration, and MLOps.

☸️

Kubernetes & Containerization

Master Kubernetes and container orchestration with hands-on workshops and architecture consulting.

Book a free consultation →

Luca Berton

AI & Cloud Advisor with 18+ years experience. Author of 8 technical books, creator of Ansible Pilot, and instructor at CopyPasteLearn Academy. Speaker at KubeCon EU & Red Hat Summit 2026.

LinkedIn Bluesky YouTube Contact →

← Back to Blog

JSON vs TOON for AI Input: Token-Efficient Data for LLMs

Compare JSON and TOON (Token-Oriented Object Notation) for feeding structured data to Large Language Models. See how TOON cuts token counts by up to 50 percent while keeping JSON compatibility.

Tue Mar 03 2026

Building Custom AI Skills with InstructLab Taxonomy

Create domain-specific AI capabilities using InstructLab's taxonomy system—from writing skill definitions to generating synthetic training data and validating fine-tuned models.

Mon Mar 02 2026

Accessing the OpenClaw Control UI Dashboard on Azure

How to access the OpenClaw Control UI dashboard from an Azure VM — via SSH tunnel (secure) or public IP. Covers device pairing, dashboard authentication, and the browser-based management interface.

Thu Feb 26 2026