In this cookbook, we’ll explore a more effective approach to evaluating multi-turn customer support agents. Traditional evaluation methods that use a single input-output pair are insufficient for agents that need to adapt their tool usage as conversations evolve. Instead, we’ll implement a simulation-based approach where an LLM evaluates our agent against specific success criteria.
The Problem with Traditional Evaluation
Traditional evaluation methods for customer support agents often use a dataset where:
- Input: Customer ticket/query
- Output: Expected sequence of tool calls
This approach has significant limitations:
- It assumes a fixed, predetermined path to resolution
- It doesn’t account for new information discovered during the conversation
- It focuses on the exact sequence of tools rather than achieving the desired outcome
A Better Approach: Simulation-Based Evaluation
Instead of predicting exact tool sequences, we’ll define success criteria that focus on what the agent must accomplish, regardless of the specific path taken. For example:
This approach:
- Focuses on outcomes rather than specific steps
- Allows for multiple valid solution paths
- Better reflects real-world customer support scenarios
Requirements
Before we start, make sure you have the necessary packages installed:
Let’s implement this simulation-based evaluation approach using mock tools for an e-commerce customer support scenario.
import json
from typing import Dict, Any, List, Tuple
from openai import AsyncOpenAI
import getpass
import langwatch
api_key = getpass.getpass("Enter your OpenAI API key: ")
client = AsyncOpenAI(api_key=api_key)
langwatch.login()
ORDERS_DB = {
"ORD123": {"status": "processing", "customer_id": "CUST456", "items": ["Product A", "Product B"]},
"ORD456": {"status": "shipped", "customer_id": "CUST789", "items": ["Product C"]},
"ORD789": {"status": "delivered", "customer_id": "CUST456", "items": ["Product D"]}
}
CUSTOMERS_DB = {
"CUST456": {"email": "[email protected]", "name": "John Doe"},
"CUST789": {"email": "[email protected]", "name": "Jane Smith"}
}
async def find_customer_by_email(email: str) -> Dict[str, Any]:
"""Find a customer by their email address."""
for customer_id, customer in CUSTOMERS_DB.items():
if customer["email"] == email:
return {"customer_id": customer_id, **customer}
return {"error": "Customer not found"}
async def get_orders_by_customer_id(customer_id: str) -> Dict[str, Any]:
"""Get all orders for a specific customer."""
orders = []
for order_id, order in ORDERS_DB.items():
if order["customer_id"] == customer_id:
orders.append({"order_id": order_id, **order})
return {"orders": orders}
async def get_order_status(order_id: str) -> Dict[str, Any]:
"""Get the status of a specific order."""
if order_id in ORDERS_DB:
return {"order_id": order_id, "status": ORDERS_DB[order_id]["status"]}
return {"error": "Order not found"}
async def update_ticket_status(ticket_id: str, status: str) -> Dict[str, Any]:
"""Update the status of a support ticket."""
return {"ticket_id": ticket_id, "status": status, "updated": True}
async def escalate_to_human() -> Dict[str, Any]:
"""Escalate the current issue to a human agent."""
return {
"status": "escalated",
"message": "A human agent has been notified and will follow up shortly."
}
TOOL_MAP = {
"find_customer_by_email": find_customer_by_email,
"get_orders_by_customer_id": get_orders_by_customer_id,
"get_order_status": get_order_status,
"update_ticket_status": update_ticket_status,
"escalate_to_human": escalate_to_human
}
TOOL_SCHEMAS = [
{
"type": "function",
"function": {
"name": "find_customer_by_email",
"description": "Find a customer by their email address.",
"parameters": {
"type": "object",
"properties": {
"email": {"type": "string", "description": "Customer email address"}
},
"required": ["email"]
}
}
},
{
"type": "function",
"function": {
"name": "get_orders_by_customer_id",
"description": "Get all orders for a specific customer.",
"parameters": {
"type": "object",
"properties": {
"customer_id": {"type": "string", "description": "Customer ID"}
},
"required": ["customer_id"]
}
}
},
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Get the status of a specific order.",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "Order ID"}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "update_ticket_status",
"description": "Update the status of a support ticket.",
"parameters": {
"type": "object",
"properties": {
"ticket_id": {"type": "string", "description": "Ticket ID"},
"status": {"type": "string", "description": "New status"}
},
"required": ["ticket_id", "status"]
}
}
},
{
"type": "function",
"function": {
"name": "escalate_to_human",
"description": "Escalate the current issue to a human agent.",
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
}
]
Define Agents
Now we’ll define our agents. We’ll create both a Planner and an Executor agent. The Planner agent is responsible for creating a plan to achieve the user’s goal, while the Executor agent is responsible for executing the plan. We also define a helper function to generate a response from the tool outputs.
class PlannerAgent:
def __init__(self, model: str = "gpt-4o"):
self.model = model
self.client = AsyncOpenAI(api_key=api_key)
async def run(self, task_history: List[Dict[str, Any]]) -> Tuple[List, str]:
"""Create a tool execution plan based on user input"""
response = await self.client.chat.completions.create(
model=self.model,
messages=task_history,
tools=TOOL_SCHEMAS,
tool_choice="auto"
)
message = response.choices[0].message
tool_calls = message.tool_calls or []
return tool_calls, message.content or ""
def initialize_history(self, ticket: Dict[str, Any]) -> List[Dict[str, Any]]:
"""Start conversation history from a ticket."""
system_prompt = """You are a helpful customer support agent for an e-commerce company.
Your job is to help customers with their inquiries about orders, products, and returns.
Use the available tools to gather information and take actions on behalf of the customer.
Always be polite, professional, and helpful."""
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": str(ticket)}
]
class ExecutorAgent:
async def run(self, tool_calls: List, task_history: List[Dict]) -> Dict[str, Any]:
"""Execute tool calls and update conversation history"""
tool_outputs = []
for call in tool_calls:
tool_name = call.function.name
args = json.loads(call.function.arguments)
func = TOOL_MAP.get(tool_name)
if func is None:
output = {"error": f"Tool '{tool_name}' not found"}
continue
try:
output = await func(**args)
except Exception as e:
output = {"error": str(e)}
task_history.append({
"role": "assistant",
"content": None,
"tool_calls": [{
"id": call.id,
"type": "function",
"function": {
"name": tool_name,
"arguments": call.function.arguments
}
}]
})
task_history.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(output)
})
tool_outputs.append({"tool_name": tool_name, "output": output})
return {"task_history": task_history, "tool_outputs": tool_outputs}
async def generate_response(tool_outputs: List[Dict], model: str = "gpt-4o") -> str:
"""Generate a human-readable response based on tool outputs"""
client = AsyncOpenAI(api_key=api_key)
system_prompt = """You are a helpful customer support agent. IMPORTANT GUIDELINES:
1. When a customer asks about cancellation, ALWAYS check the order status first
2. EXPLICITLY inform the customer if cancellation is possible based on the status:
- If status is 'processing' or 'pending', tell them cancellation IS possible
- If status is 'shipped' or 'delivered', tell them cancellation is NOT possible
3. Always be polite, professional, and helpful"""
prompt = "Based on the tool outputs, generate a helpful response to the customer:\n\n"
for output in tool_outputs:
prompt += f"{output['tool_name']} result: {json.dumps(output['output'])}\n"
response = await client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
Evaluator Agent
The Evaluator Agent evaluates our multi-turn agent behavior using binary success criteria over full simulated conversations. This method moves beyond traditional input/output (I/O) pair evaluation, addressing the stochastic and flexible nature of agent workflows.
Simulation Function
Below we define a method to simulate conversations between our agent and a user. The outputs will be evaluated by our Evaluator Agent.
Running the Simulation
Now, let’s define a test ticket and our success criteria, then run the simulation:
Simulation Output
Here’s an example of the output you would see when running this simulation:
Conclusion
Traditional evaluation methods that rely on fixed input-output pairs are insufficient for multi-turn conversational agents. By simulating complete conversations and evaluating against outcome-based criteria, we can better assess an agent’s ability to handle real-world customer support scenarios.
Key benefits of this approach include:
- Flexibility in solution paths: The agent can take different valid approaches to solve the same problem
- Focus on outcomes: Evaluation is based on what the agent accomplishes, not how it gets there
- Adaptability to new information: The agent can adjust its strategy based on information discovered during the conversation
- Realistic assessment: The evaluation better reflects how agents would perform in real-world scenarios
As you develop your own multi-turn agents, consider implementing this simulation-based evaluation approach to get a more accurate picture of their performance and to identify specific areas for improvement.
For the full notebook, check it out on: GitHub.