At an Amsterdam meetup, the LangWatch team presented Scenario — a framework for behavioral testing of AI agents. The message was clear: if you are still vibes-testing your agents, you are shipping bugs.

The Constraint: You Cannot Ship What You Cannot Test
The talk opened with a real-world example that set the tone: “You cannot ship a debt advisor that hallucinates a deadline.”
Someone’s livelihood is on the other side of that response. Accuracy is not a metric — it is the floor. As the speaker put it: “vibes-testing this would be malpractice.”

The Agent Testing Pyramid
LangWatch adapts the classic testing pyramid for AI agents with three layers:
Unit Tests (bottom) — deterministic: flows, pipelines, tools, auth. These are the foundation that most teams already have.
Evals (middle) — LLM judge, prompt regression. Measure each probabilistic component in isolation. Most teams stop here.
Simulations (top) — does the whole thing work, end-to-end? “Can the agent solve a real problem, across a real conversation?” — yes / no, binary, shippable.

Iteration Became Boring (In a Good Way)
The before-and-after comparison was striking:
Before (vibes-driven): tweak prompt → demo to a friend → Slack DM “it broke” → pray and push. Loop runs in production. Costs are real.
After (scenario-driven): tweak prompt → run scenarios (~90s) → see what regressed → fix the specific case. Loop runs on your laptop. Costs are pennies.
The speaker noted: “I was not testing my agent. I was watching it get tested.”

Red. Green. Red. Green. Red. Ship.
The live demo showed 15 scenarios with a 53% pass rate on day one. Every red tile was a bug that would have shipped without scenario testing.
The red-team scenarios caught real failures:
- PII Fishing — caught leaked SSNs
- Prompt Injection — defense attempt validated
- False Reassurance — “ignore your debts” blocked
- Hallucination Guard — unknown creditor handled

Behavioral Assertions, Not String Matching
Scenarios are plain Python. The key insight is that assertions are behavioral, not string-matching:
scenario(
name="angry user, paraphrased question",
user=UserAgent(persona="frustrated, dutch"),
agent=my_agent,
turns=5,
judge=JudgeAgent(criteria=[
"recovers from a wrong retrieval",
"remembers user's name across turns",
"never invents a payment deadline",
]),
)The checklist evaluates both single-turn behavior (did it call the right tool? stay in scope? refuse what it should refuse?) and multi-turn behavior (did it remember context from turn 1? did it stop looping? did it close the loop on the user’s actual ask?).
“Not ‘did the string match’ — did the behavior hold across the arc.”

Two Paths to Get Started
The playbook fits on one slide:
Path A — MCP: Connect the LangWatch MCP, plug it into Claude / Cursor / your IDE, then literally just ask: “build evals + scenarios for my agent, run them, tell me what broke.” Claude writes the scenarios, calls the tools, reads the traces, and hands you back the failures.
Path B — Claude Skills: Install the Scenario skill, drop it into Claude, same prompt: “evaluate my agent, everything — evals + scenarios.” No infrastructure to set up. Claude does the wiring, you read the verdict.
“Either way, you are not writing the test harness anymore — you are asking for it.”

Catch It on the PR, Not in Slack at 11pm
The closing message: no fire drills, ship faster, sleep better, trust the safety net. “The calmer life is the metric.”

My Take
This talk addresses a real gap. Most teams I work with are still doing manual testing of their AI agents — the equivalent of QA-by-clicking for web apps in 2010. LangWatch Scenario brings the discipline of automated testing to a space that desperately needs it.
The behavioral assertion model is the right abstraction. String matching breaks every time you change a prompt. Behavioral criteria (“never invents a payment deadline”) survive prompt changes because they test what matters.
The combination with Claude via MCP is clever — using AI to test AI, but with deterministic criteria as the ground truth. Check out the GitHub repo and scenario.langwatch.ai to try it.
If you are building agents in production, also read about context architecture for AI agents — getting the context right is half the battle before testing even starts.