What are Agent Simulations?

Agent simulations are a powerful approach to testing AI agents that goes beyond traditional evaluation methods. Unlike static input-output testing, simulations test your agent’s behavior in realistic, multi-turn conversations that mimic how real users would interact with your system.

The Three Levels of Agent Quality

For comprehensive agent testing, you need all three levels:

Level 1: Unit tests
Traditional unit and integration software tests to guarantee that e.g. the agent tools are working correctly from a software point of view
Level 2: Evals, Finetunning and Prompt Optimization
Measuring the performance of individual non-deterministic components of the agent, for example maximizing RAG accuracy with evals, or approximating human preference with GRPO
Level 3: Agent Simulations
End-to-end testing of the agent in different scenarios and edge cases, guaranteeing the whole agent achieves more than the sum of its parts, simulating a wide range of situations

Simulations complement evaluations by testing the agent as a whole system rather than isolated parts.

Why Traditional Evaluation Isn’t Enough for Agents

Most evaluations are based on dataset, with a static set of cases, those are hard to get specially when you are just getting started, they often require a great amount of examples to be valuable, and an expected answer to be provided, but more than anything, they are static, like input to output, or query to expected_contexts. Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.

Evaluation dataset (single input-output pairs):

query	expected_answer
What is your refund policy?	We offer a 30-day money-back guarantee on all purchases.
How do I cancel my subscription?	You can cancel your subscription by logging into your account and clicking the “Cancel Subscription” button.

❌ Doesn’t consider the conversational flow
❌ Can’t specify how middle steps should be evaluated
❌ Hard to interpret and debug
❌ Ignores user experience aspects
❌ Hard to come up with a good dataset

Agent simulation (full multi-turn descriptions):

script=[
  scenario.user("hey I have a problem with my order"),
  scenario.agent(),
  expect_ticket_created()
  expect_ticket_label("ecommerce")
  scenario.user("i want a refund!"),
  scenario.agent()
  expect_tool_call("search_policy")
  scenario.user("this is ridiculous! let me talk to a human being")
  scenario.agent()
  expect_tool_call("escalate_to_human")
]

✅ Describes the entire conversation
✅ Explicitly evaluates in-between steps
✅ Easy to interpret and debug
✅ Easy to replicate and reproduce an issue found in production
✅ Can run in autopilot for simulating a variety of inputs This doesn’t mean you should stop doing evaluations, in fact, having evaluations and simulations together is what composes your full agent test suite:

Use evaluations for testing the smaller parts that compose the agent, where a more “machine learning” approach is required, for optimizing a specific LLM call or retrieval for example.
Use simulation-based testing for proving the agent’s behavior is correct end-to-end, replicate specific edge cases, and guide your agent’s development without regressions.

Why Use LangWatch Scenario?

Scenario is the most advanced agent testing framework available. It provides:

Powerful simulations - Test real agent behavior by simulating users in different scenarios and edge cases
Flexible evaluations - Judge agent behavior at any point in conversations, combine with evals, test error recovery, and complex workflows
Framework agnostic - Works with any AI agent framework
Simple integration - Just implement one call() method
Multi-language support - Python, TypeScript, and Go

Visualizing Simulations in LangWatch

Once you’ve set up your agent tests with Scenario, LangWatch provides powerful visualization tools to:

Organize simulations into sets and batches
Debug agent behavior by stepping through conversations
Track performance over time with run history
Collaborate with your team on agent improvements

The rest of this documentation will show you how to use LangWatch’s simulation visualizer to get the most out of your agent testing. Simulations Sets

Next Steps

Overview - Learn about LangWatch’s simulation visualizer
Getting Started - Set up your first simulation
Individual Run Analysis - Learn how to debug specific scenarios
Batch Runs - Understand how to organize multiple tests
Scenario Documentation - Deep dive into the testing framework

Get Started

Agent Simulations

LLM Observability

LLM Evaluation

Prompt Management

LLM Development

API Endpoints

Use Cases

Support

Introduction to Agent Testing

What are Agent Simulations?

The Three Levels of Agent Quality

Why Traditional Evaluation Isn’t Enough for Agents

Evaluation dataset (single input-output pairs):

Agent simulation (full multi-turn descriptions):

Why Use LangWatch Scenario?

Visualizing Simulations in LangWatch

Next Steps

Get Started

Agent Simulations

LLM Observability

LLM Evaluation

Prompt Management

LLM Development

API Endpoints

Use Cases

Support

​What are Agent Simulations?

​The Three Levels of Agent Quality

​Why Traditional Evaluation Isn’t Enough for Agents

​Evaluation dataset (single input-output pairs):

​Agent simulation (full multi-turn descriptions):

​Why Use LangWatch Scenario?

​Visualizing Simulations in LangWatch

​Next Steps

What are Agent Simulations?

The Three Levels of Agent Quality

Why Traditional Evaluation Isn’t Enough for Agents

Evaluation dataset (single input-output pairs):

Agent simulation (full multi-turn descriptions):

Why Use LangWatch Scenario?

Visualizing Simulations in LangWatch

Next Steps