Retrieval-Augmented Generation (RAG) bridges the gap between general-purpose LLMs and your organizationβs proprietary knowledge. Instead of fine-tuning a model on your data (expensive, slow, goes stale), RAG retrieves relevant context at query time and feeds it to the LLM.
How RAG Works
User Query β Embedding β Vector Search β Top-K Documents β LLM + Context β Answer- User asks a question: βWhat is our SLA for tier-1 customers?β
- Query embedding: Convert the question to a vector (1536 dimensions with OpenAI, 768 with open models)
- Vector search: Find the most similar document chunks in the vector database
- Context assembly: Combine the top-K results into a prompt
- LLM generation: The model answers using the retrieved context, not its training data
Architecture Components
Document Ingestion Pipeline
Raw Documents β Chunking β Embedding β Vector Store
β β β β
PDF, Docs, Split into Convert to Store in
Confluence, overlapping vectors via Qdrant,
Slack, Git chunks embedding Weaviate,
(512-1024 model Milvus,
tokens) pgvectorChunking Strategy
The most important decision in RAG. Poor chunking = poor retrieval = poor answers.
# Bad: Fixed-size chunks break mid-sentence
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
# Better: Semantic chunking with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["
", "
", ". ", " "]
)
chunks = splitter.split_text(text)
# Best: Document-structure-aware chunking
# Split on headings, preserve sections, keep tables intactVector Database Selection
| Database | Type | Strengths | Weaknesses |
|---|---|---|---|
| pgvector | Extension | Uses existing Postgres, simple | Limited scale |
| Qdrant | Purpose-built | Fast, rich filtering | Another service to manage |
| Weaviate | Purpose-built | Multi-modal, GraphQL API | Complex setup |
| Milvus | Purpose-built | Massive scale, GPU-accelerated | Operational overhead |
| Pinecone | Managed SaaS | Zero ops, fast | Vendor lock-in, cost at scale |
| ChromaDB | Embedded | Simple, great for prototyping | Not for production scale |
Embedding Models
| Model | Dimensions | Speed | Quality |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | API latency | Excellent |
| OpenAI text-embedding-3-small | 1536 | API latency | Very good |
| BGE-large-en-v1.5 | 1024 | Self-hosted | Very good |
| E5-large-v2 | 1024 | Self-hosted | Good |
| all-MiniLM-L6-v2 | 384 | Very fast | Acceptable |
For enterprise use, self-hosted models (BGE, E5) avoid sending data to external APIs.
Advanced RAG Patterns
Hybrid Search
Combine vector search with keyword search for better retrieval:
# Vector search finds semantically similar documents
vector_results = vector_db.search(query_embedding, top_k=10)
# BM25 keyword search finds exact term matches
keyword_results = bm25_index.search(query_text, top_k=10)
# Reciprocal Rank Fusion combines both
final_results = reciprocal_rank_fusion(vector_results, keyword_results)Multi-Query RAG
Generate multiple search queries from a single user question:
# User asks: "How does our authentication work?"
# Generate search queries:
queries = [
"authentication system architecture",
"login flow and OAuth implementation",
"user session management",
"SSO and identity provider integration"
]
# Search with all queries, deduplicate resultsParent Document Retrieval
Store small chunks for precise retrieval, but return the full parent document for context:
# Index small chunks (256 tokens) for precise matching
# But return the parent section (2000 tokens) for full context
# This gives the LLM enough surrounding context to answer wellProduction Considerations
- Freshness: Schedule re-ingestion for frequently updated sources (Confluence, Git)
- Access control: Filter results based on user permissions β do not let RAG bypass document ACLs
- Evaluation: Measure retrieval quality (recall@k) and answer quality (faithfulness, relevance)
- Caching: Cache common query embeddings and their results
- Monitoring: Track retrieval latency, cache hit rate, LLM token usage, and answer quality scores
Cost Estimation
Monthly RAG cost for 10K documents, 1K queries/day:
Embedding (one-time): 10K docs Γ avg 2K tokens = 20M tokens β $2.60
Vector DB: Qdrant cloud small instance β $25/month
LLM inference: 1K queries Γ 2K context tokens Γ 30 days = 60M tokens β ~$90/month
Re-embedding (weekly): $0.65/week β $2.60/month
Total: ~$120/month for 30K queries