π RAG Beyond the Demo
Every AI demo uses RAG. Few do it well in production. The gap between a prototype that queries a vector database and an enterprise system that serves thousands of users with accurate, sourced answers is enormous.
Hereβs what Iβve learned deploying RAG systems at scale.
Architecture Overview
Documents β Ingestion Pipeline β Vector DB + Metadata Store
β
User Query β Query Planner β Hybrid Search β Reranker β LLM β Response
β
Citation Check
Chunking Strategies That Actually Work
The most impactful decision in RAG isnβt your model or vector database β itβs how you chunk documents.
Semantic Chunking
Instead of fixed-size chunks, split on semantic boundaries:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Bad: Fixed-size chunks break context
bad_splitter = RecursiveCharacterTextSplitter(chunk_size=500)
# Better: Respect document structure
good_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)
Hierarchical Chunking
Store chunks at multiple granularities:
class HierarchicalChunker:
def chunk(self, document):
# Level 1: Full sections (for context)
sections = self.split_by_headings(document)
# Level 2: Paragraphs (for retrieval)
paragraphs = []
for section in sections:
for para in self.split_paragraphs(section):
para.metadata["parent_section"] = section.id
paragraphs.append(para)
return sections, paragraphs
When a paragraph matches, retrieve its parent section for full context. This dramatically improves answer quality.
Every chunk needs rich metadata:
chunk_metadata = {
"source": "architecture-guide-v3.pdf",
"page": 42,
"section": "Security Requirements",
"author": "Platform Team",
"last_updated": "2026-01-15",
"document_type": "technical_spec",
"access_level": "internal",
"chunk_index": 7,
"total_chunks": 23,
}
Vector Database Selection
| Database | Best For | Kubernetes-Native | Hybrid Search |
|---|
| Milvus | Large scale (100M+ vectors) | β
(Helm chart) | β
|
| Qdrant | Mid-scale, filtering | β
(Operator) | β
|
| Weaviate | Multi-modal, GraphQL | β
(Helm) | β
|
| pgvector | Small scale, existing Postgres | Via operator | Limited |
| ChromaDB | Prototypes only | β | β |
For enterprise Kubernetes deployments, I recommend Milvus for scale or Qdrant for simplicity.
Hybrid Search: The Secret Weapon
Pure vector search misses exact matches. Pure keyword search misses semantics. Combine both:
async def hybrid_search(query: str, top_k: int = 20) -> list[Chunk]:
# Semantic search
embedding = await embed(query)
vector_results = await vector_db.search(embedding, top_k=top_k)
# Keyword search (BM25)
keyword_results = await search_engine.search(query, top_k=top_k)
# Reciprocal Rank Fusion
combined = reciprocal_rank_fusion(
[vector_results, keyword_results],
k=60,
)
return combined[:top_k]
def reciprocal_rank_fusion(result_lists, k=60):
scores = {}
for results in result_lists:
for rank, doc in enumerate(results):
if doc.id not in scores:
scores[doc.id] = 0
scores[doc.id] += 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Reranking
Initial retrieval casts a wide net. A reranker picks the best results:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def rerank(query: str, chunks: list[Chunk], top_k: int = 5) -> list[Chunk]:
pairs = [(query, chunk.text) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
return [chunk for chunk, score in ranked[:top_k]]
Scaling on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-api
spec:
replicas: 3
template:
spec:
containers:
- name: rag-service
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
env:
- name: VECTOR_DB_HOST
value: "milvus.vector-db.svc.cluster.local"
- name: RERANKER_MODEL
value: "cross-encoder/ms-marco-MiniLM-L-12-v2"
- name: CACHE_TTL_SECONDS
value: "3600"
---
# HPA for automatic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rag-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rag-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Evaluation: How to Know Your RAG Works
You canβt improve what you donβt measure:
class RAGEvaluator:
async def evaluate(self, test_set: list[QAPair]):
results = []
for qa in test_set:
response = await rag_pipeline.query(qa.question)
results.append({
"faithfulness": self.check_faithfulness(response, qa.context),
"relevance": self.check_relevance(response, qa.question),
"correctness": self.check_correctness(response, qa.answer),
"has_citations": bool(response.citations),
"latency_ms": response.latency_ms,
})
return aggregate_metrics(results)
Track these metrics weekly. RAG quality degrades as your document corpus changes.
Key Takeaways
- Chunking is everything β invest time in semantic, hierarchical chunking with rich metadata
- Hybrid search beats vector-only β combine BM25 and embeddings with reciprocal rank fusion
- Always rerank β a lightweight cross-encoder dramatically improves precision
- Evaluate continuously β build a test set and track faithfulness, relevance, and correctness
- Cache aggressively β similar queries get similar results; cache at the embedding and response level
Building an enterprise RAG system? I help organizations design production-grade retrieval architectures. Get in touch.