Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Enterprise RAG Architecture Patterns
AI

RAG Architecture for Enterprise Knowledge

Production RAG is nothing like the tutorials. Handle document versioning, access control, hybrid search, re-ranking, and hallucination detection.

LB
Luca Berton
Β· 2 min read

Retrieval-Augmented Generation (RAG) bridges the gap between general-purpose LLMs and your organization’s proprietary knowledge. Instead of fine-tuning a model on your data (expensive, slow, goes stale), RAG retrieves relevant context at query time and feeds it to the LLM.

How RAG Works

User Query β†’ Embedding β†’ Vector Search β†’ Top-K Documents β†’ LLM + Context β†’ Answer
  1. User asks a question: β€œWhat is our SLA for tier-1 customers?”
  2. Query embedding: Convert the question to a vector (1536 dimensions with OpenAI, 768 with open models)
  3. Vector search: Find the most similar document chunks in the vector database
  4. Context assembly: Combine the top-K results into a prompt
  5. LLM generation: The model answers using the retrieved context, not its training data

Architecture Components

Document Ingestion Pipeline

Raw Documents β†’ Chunking β†’ Embedding β†’ Vector Store
     ↓              ↓           ↓           ↓
  PDF, Docs,    Split into   Convert to   Store in
  Confluence,   overlapping  vectors via  Qdrant,
  Slack, Git    chunks       embedding    Weaviate,
                (512-1024    model        Milvus,
                 tokens)                  pgvector

Chunking Strategy

The most important decision in RAG. Poor chunking = poor retrieval = poor answers.

# Bad: Fixed-size chunks break mid-sentence
chunks = [text[i:i+500] for i in range(0, len(text), 500)]

# Better: Semantic chunking with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["

", "
", ". ", " "]
)
chunks = splitter.split_text(text)

# Best: Document-structure-aware chunking
# Split on headings, preserve sections, keep tables intact

Vector Database Selection

DatabaseTypeStrengthsWeaknesses
pgvectorExtensionUses existing Postgres, simpleLimited scale
QdrantPurpose-builtFast, rich filteringAnother service to manage
WeaviatePurpose-builtMulti-modal, GraphQL APIComplex setup
MilvusPurpose-builtMassive scale, GPU-acceleratedOperational overhead
PineconeManaged SaaSZero ops, fastVendor lock-in, cost at scale
ChromaDBEmbeddedSimple, great for prototypingNot for production scale

Embedding Models

ModelDimensionsSpeedQuality
OpenAI text-embedding-3-large3072API latencyExcellent
OpenAI text-embedding-3-small1536API latencyVery good
BGE-large-en-v1.51024Self-hostedVery good
E5-large-v21024Self-hostedGood
all-MiniLM-L6-v2384Very fastAcceptable

For enterprise use, self-hosted models (BGE, E5) avoid sending data to external APIs.

Advanced RAG Patterns

Combine vector search with keyword search for better retrieval:

# Vector search finds semantically similar documents
vector_results = vector_db.search(query_embedding, top_k=10)

# BM25 keyword search finds exact term matches
keyword_results = bm25_index.search(query_text, top_k=10)

# Reciprocal Rank Fusion combines both
final_results = reciprocal_rank_fusion(vector_results, keyword_results)

Multi-Query RAG

Generate multiple search queries from a single user question:

# User asks: "How does our authentication work?"
# Generate search queries:
queries = [
    "authentication system architecture",
    "login flow and OAuth implementation",
    "user session management",
    "SSO and identity provider integration"
]
# Search with all queries, deduplicate results

Parent Document Retrieval

Store small chunks for precise retrieval, but return the full parent document for context:

# Index small chunks (256 tokens) for precise matching
# But return the parent section (2000 tokens) for full context
# This gives the LLM enough surrounding context to answer well

Production Considerations

  1. Freshness: Schedule re-ingestion for frequently updated sources (Confluence, Git)
  2. Access control: Filter results based on user permissions β€” do not let RAG bypass document ACLs
  3. Evaluation: Measure retrieval quality (recall@k) and answer quality (faithfulness, relevance)
  4. Caching: Cache common query embeddings and their results
  5. Monitoring: Track retrieval latency, cache hit rate, LLM token usage, and answer quality scores

Cost Estimation

Monthly RAG cost for 10K documents, 1K queries/day:

Embedding (one-time): 10K docs Γ— avg 2K tokens = 20M tokens β†’ $2.60
Vector DB: Qdrant cloud small instance β†’ $25/month
LLM inference: 1K queries Γ— 2K context tokens Γ— 30 days = 60M tokens β†’ ~$90/month
Re-embedding (weekly): $0.65/week β†’ $2.60/month

Total: ~$120/month for 30K queries

Free 30-min AI & Cloud consultation

Book Now