LangWatch Scenario: Stop Vibes-Testing Your AI Agents

At an Amsterdam meetup, the LangWatch team presented Scenario — a framework for behavioral testing of AI agents. The message was clear: if you are still vibes-testing your agents, you are shipping bugs.

LangWatch Scenario meetup panel with three speakers in Amsterdam

The Constraint: You Cannot Ship What You Cannot Test

The talk opened with a real-world example that set the tone: “You cannot ship a debt advisor that hallucinates a deadline.”

Someone’s livelihood is on the other side of that response. Accuracy is not a metric — it is the floor. As the speaker put it: “vibes-testing this would be malpractice.”

The Agent Testing Pyramid

LangWatch adapts the classic testing pyramid for AI agents with three layers:

Unit Tests (bottom) — deterministic: flows, pipelines, tools, auth. These are the foundation that most teams already have.

Evals (middle) — LLM judge, prompt regression. Measure each probabilistic component in isolation. Most teams stop here.

Simulations (top) — does the whole thing work, end-to-end? “Can the agent solve a real problem, across a real conversation?” — yes / no, binary, shippable.

Agent Testing Stack: three layers stacked — Unit Tests, Evals, Simulations

Iteration Became Boring (In a Good Way)

The before-and-after comparison was striking:

Before (vibes-driven): tweak prompt → demo to a friend → Slack DM “it broke” → pray and push. Loop runs in production. Costs are real.

After (scenario-driven): tweak prompt → run scenarios (~90s) → see what regressed → fix the specific case. Loop runs on your laptop. Costs are pennies.

The speaker noted: “I was not testing my agent. I was watching it get tested.”

Iteration became boring: vibes-driven vs scenario-driven development

Red. Green. Red. Green. Red. Ship.

The live demo showed 15 scenarios with a 53% pass rate on day one. Every red tile was a bug that would have shipped without scenario testing.

The red-team scenarios caught real failures:

PII Fishing — caught leaked SSNs
Prompt Injection — defense attempt validated
False Reassurance — “ignore your debts” blocked
Hallucination Guard — unknown creditor handled

Red green red green red ship: 15 scenarios with red-team failures caught

Behavioral Assertions, Not String Matching

Scenarios are plain Python. The key insight is that assertions are behavioral, not string-matching:

scenario(
    name="angry user, paraphrased question",
    user=UserAgent(persona="frustrated, dutch"),
    agent=my_agent,
    turns=5,
    judge=JudgeAgent(criteria=[
        "recovers from a wrong retrieval",
        "remembers user's name across turns",
        "never invents a payment deadline",
    ]),
)

The checklist evaluates both single-turn behavior (did it call the right tool? stay in scope? refuse what it should refuse?) and multi-turn behavior (did it remember context from turn 1? did it stop looping? did it close the loop on the user’s actual ask?).

“Not ‘did the string match’ — did the behavior hold across the arc.”

Scenario code example with behavioral assertions and JudgeAgent criteria

Two Paths to Get Started

The playbook fits on one slide:

Path A — MCP: Connect the LangWatch MCP, plug it into Claude / Cursor / your IDE, then literally just ask: “build evals + scenarios for my agent, run them, tell me what broke.” Claude writes the scenarios, calls the tools, reads the traces, and hands you back the failures.

Path B — Claude Skills: Install the Scenario skill, drop it into Claude, same prompt: “evaluate my agent, everything — evals + scenarios.” No infrastructure to set up. Claude does the wiring, you read the verdict.

“Either way, you are not writing the test harness anymore — you are asking for it.”

Two paths: LangWatch MCP or Claude Skills for agent testing

Catch It on the PR, Not in Slack at 11pm

The closing message: no fire drills, ship faster, sleep better, trust the safety net. “The calmer life is the metric.”

My Take

This talk addresses a real gap. Most teams I work with are still doing manual testing of their AI agents — the equivalent of QA-by-clicking for web apps in 2010. LangWatch Scenario brings the discipline of automated testing to a space that desperately needs it.

The behavioral assertion model is the right abstraction. String matching breaks every time you change a prompt. Behavioral criteria (“never invents a payment deadline”) survive prompt changes because they test what matters.

The combination with Claude via MCP is clever — using AI to test AI, but with deterministic criteria as the ground truth. Check out the GitHub repo and scenario.langwatch.ai to try it.

If you are building agents in production, also read about context architecture for AI agents — getting the context right is half the battle before testing even starts.

LangWatch Scenario: Stop Vibes-Testing Your AI Agents

The Constraint: You Cannot Ship What You Cannot Test

The Agent Testing Pyramid

Iteration Became Boring (In a Good Way)

Red. Green. Red. Green. Red. Ship.

Behavioral Assertions, Not String Matching

Two Paths to Get Started

Catch It on the PR, Not in Slack at 11pm

My Take

Related Articles

AI Is Making the Biggest Platforms Even Bigger

Embodied AI Infrastructure for the Physical World

Is Your Website Ready for AI Agents?

AI Governance in Practice: Findings Remediation and Agent Identity