Everyone says context matters. Here are the numbers.
There is a lot of talk about βcontext architectureβ for AI agents. But how much does context actually matter? And which context moves the needle?
A team ran a rigorous experiment with a real AI analytics agent built for a healthcare company operating multiple clinics. No synthetic datasets. Real user questions. Same LLM, same data, zero prompt engineering tricks. They added context one layer at a time, measuring accuracy after every change.
The results are striking β and they validate something that experienced data engineers already know intuitively.
The experiment: 6 iterations
| Iteration | What Changed | SQL Generation | Accuracy |
|---|---|---|---|
| 1 | Raw tables only | 0% | 0% |
| 2 | Modeled table (no context) | 38.5% | 0% |
| 3 | Column descriptions added | 100% | 15% |
| 4 | Business rules and instructions | 100% | 77% |
| 5-6 | Metrics, verified queries, eval refinement | 100% | 92% |
Let that sink in. The same LLM went from 0% accuracy to 92% accuracy β not by switching to a better model, not by clever prompt engineering, but by progressively adding the context that a good analyst accumulates over months on the job.
The boring stuff had the biggest impact
The single most impactful change was not some sophisticated RAG pipeline or multi-agent orchestration. It was column descriptions and a clean data model.
Going from raw tables to a properly modeled table with column descriptions took SQL generation from 0% to 100%. The agent could always generate SQL. What it could not do was generate SQL that gave trustworthy results until it understood what the columns actually meant.
This should not be surprising to anyone who has onboarded a new analyst. You do not hand them raw database access and say βfigure it out.β You give them:
- A data dictionary explaining what each column means
- Business rules about how metrics are calculated
- Context about edge cases and data quirks
- Verified queries they can use as reference
The AI agent needs exactly the same thing.
Context layers ranked by impact
Based on the experiment results and my experience deploying AI agents against enterprise data, here is how I rank the context layers:
Tier 1: Foundation (0% β 15% accuracy)
Data modeling and column descriptions. This is the single most important investment. A well-modeled table with clear column names and descriptions gives the LLM enough to generate syntactically correct SQL that actually references the right data.
Without this, the agent is guessing. It might generate valid SQL, but against the wrong columns, with wrong join conditions, and wrong aggregation logic.
What to include:
# Example column description format
tables:
- name: appointments
description: "Patient appointments across all clinic locations"
columns:
- name: appointment_date
description: "Date of the appointment (UTC). NULL for cancelled appointments that were never rescheduled."
type: date
- name: provider_id
description: "Foreign key to providers table. Maps to the attending physician, not the referring physician."
type: integer
- name: status
description: "Current appointment status. Values: scheduled, completed, cancelled, no_show. Note: 'completed' means the patient was seen, not that billing is finalized."
type: varcharThe specificity matters. βappointment_dateβ is not enough. βDate of the appointment (UTC). NULL for cancelled appointments that were never rescheduledβ β that is the context that prevents wrong answers.
Tier 2: Business logic (15% β 77% accuracy)
Business rules and calculation instructions. This is where domain knowledge lives. Every organization has implicit rules that are not encoded in the schema:
- βActive patientsβ means patients with at least one appointment in the last 12 months
- Revenue calculations exclude write-offs and adjustments
- βNew patientβ is defined by the first appointment at any location, not per-clinic
- Clinic performance metrics use a rolling 90-day window, not calendar quarter
Without these rules, the agent will generate technically correct SQL that answers the wrong question. It will count all patients instead of active ones. It will include write-offs in revenue. It will define βnewβ differently than the business does.
business_rules:
- rule: "Active patient definition"
description: "A patient is considered active if they have at least one completed appointment in the trailing 12 months from the query date."
sql_hint: "WHERE status = 'completed' AND appointment_date >= CURRENT_DATE - INTERVAL '12 months'"
- rule: "Revenue calculation"
description: "Revenue = sum of payment_amount where payment_status = 'posted'. Exclude adjustments (type = 'adjustment') and write-offs (type = 'writeoff')."
sql_hint: "SUM(payment_amount) WHERE payment_status = 'posted' AND type NOT IN ('adjustment', 'writeoff')"Tier 3: Verification (77% β 92% accuracy)
Verified queries, metrics definitions, and evaluation refinement. This is the layer that turns a decent agent into a reliable one:
- Golden queries β known-correct SQL for common questions, used as few-shot examples
- Metric definitions β exact formulas for KPIs with test cases
- Edge case documentation β what happens with NULL values, timezone boundaries, fiscal year vs calendar year
verified_queries:
- question: "How many new patients did we see last month?"
sql: |
SELECT COUNT(DISTINCT patient_id)
FROM appointments
WHERE status = 'completed'
AND appointment_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND appointment_date < DATE_TRUNC('month', CURRENT_DATE)
AND patient_id NOT IN (
SELECT DISTINCT patient_id
FROM appointments
WHERE status = 'completed'
AND appointment_date < DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
)
expected_result_range: "Typically 50-200 per clinic per month"
notes: "New = first completed appointment ever, not first at a specific clinic"What this means for agent builders
1. Invest in data foundations before agent sophistication
The biggest accuracy gains came from the βboringβ work β clean data models and column descriptions. If you are building an AI agent that queries data, spend 80% of your effort on the data layer and 20% on the agent layer.
2. Context is not just retrieval
RAG gets all the attention, but this experiment shows that structured context (schemas, rules, verified queries) matters more than unstructured document retrieval for data agents. The context needs to be precise, not just relevant.
3. Model choice is secondary to context quality
The experiment used the same LLM throughout. The accuracy difference came entirely from context. Upgrading from GPT-4 to GPT-5 will not fix an agent that does not understand your data model. Fixing your column descriptions will.
4. Build context iteratively
Do not try to capture all business rules on day one. The experiment shows clear returns at each layer:
- Week 1: Model your tables and write column descriptions
- Week 2: Document the top 10 business rules that affect query results
- Week 3: Create verified queries for the 20 most common questions
- Week 4: Run evaluations and refine based on failure cases
5. This applies beyond SQL agents
The same principle applies to any agent that operates on domain-specific data:
- Infrastructure agents need context about your environment topology, naming conventions, and runbook procedures
- Code review agents need context about your teamβs coding standards, architecture decisions, and tech debt areas
- Customer support agents need context about product features, known issues, and escalation rules
The pattern is universal: the agent needs the context that an experienced human accumulates over time.
The context architecture stack
Based on these results and production deployments, here is the context architecture I recommend:
βββββββββββββββββββββββββββββββββββββββ
β Verified Queries β β Few-shot examples
β (Golden SQLs) β 77% β 92%
βββββββββββββββββββββββββββββββββββββββ€
β Business Rules β β Domain logic
β (Metrics, Definitions) β 15% β 77%
βββββββββββββββββββββββββββββββββββββββ€
β Column Descriptions β β Schema context
β (Data Dictionary) β 0% β 15%
βββββββββββββββββββββββββββββββββββββββ€
β Data Model β β Clean tables
β (Normalized, Well-Named) β Foundation
βββββββββββββββββββββββββββββββββββββββEach layer multiplies the value of the layers below it. Skip the foundation and no amount of business rules will help. Skip the business rules and your verified queries will not generalize to new questions.
Key takeaway
Context was the difference between 0% and 92% accuracy. Not the model. Not the prompt. Not the agent framework.
The βboringβ data engineering work β modeling tables, writing column descriptions, documenting business rules β had more impact on agent accuracy than any other factor.
If you are building AI agents that need to work with real data, stop optimizing your prompts and start investing in your context layers.