Evaluating Tool Selection
Understand how to evaluate tools and components in your RAG pipeline—covering retrievers, embedding models, chunking strategies, and vector stores.
In this cookbook, we demonstrate how to evaluate tool calling capabilities in LLM applications using objective metrics. Like always, we’ll focus on data-driven approaches to measure and improve tool selection performance.
When building AI assistants, we often need them to use external tools - searching databases, calling APIs, or processing data. But how do we know if our model is selecting the right tools at the right time? Traditional evaluation methods don’t capture this well.
Imagine you’re building a customer service bot. A user asks “What’s my account balance?” Your assistant needs to decide: should it query the account database, ask for authentication, or simply respond with general information? Selecting the wrong tool leads to either frustrated users (if important tools are missed) or wasted resources (if unnecessary tools are called).
The key insight is that tool selection quality is distinct from text generation quality. You can have a model that writes beautiful responses but consistently fails to take appropriate actions. By measuring precision and recall of tool selection decisions, we can systematically improve how our models interact with the world around them.
Requirements
Before starting, ensure you have the following packages installed:
Setup
Start by setting up LangWatch to monitor your RAG application:
Metrics
To start evaluating, you need to do 3 things:
- Define the tools that your model can call
- Define an evaluation dataset of queries and corresponding expected tool calls
- Define a function to calculate precision and recall.
Before defining our tools, let’s take a look at the metrics we will be working with. In contrast to RAG, we will be using a different set of metrics for evaluating tool calling, namely precision and recall.
Remember:
- Precision: The ratio of correct tool calls to total tool calls
- Recall: The ratio of correct tool calls to total possible tool calls
In RAG, precision was less important since we relied on the model’s ability to filter out relevant documents. In tool calling, precision is very important. For example, let’s say the model calls the following tools: get calendar events, create reminder, and send email about the event. If all we really cared about is that the model tells us what time an event is, we don’t care about the reminder nor the email. As oppposed to RAG, the model won’t filter these tools out for us (technically you could chain it with another LLM to do this for you, but this is not a standard practice). It will call them, leading to increased latency and cost. Recall is, just like standard RAG, important. If we’re not calling the right tools, we might miss out on potential tools that the user needs.
Defining Tools
Let’s start by defining our tools. When starting out, you can define a small set of 3-4 tools to evaluate. Once the evaluation framework is set in place, you can scale the number of tools to evaluate. For this application, I’ll be looking at 3 tools: get calendar events, create reminder, and send email about the event.
We’ll use OpenAI’s API to call tools. Note that OpenAI’s tools parameters expects the functions to be defined in a specific way. In the utils folder, we define a function that takes a function as input and returns a schema in the format that OpenAI expects.
Define an Eval Set
Now that we have our tools defined, we can define an eval set. I’ll test the model for its ability to call a single and a combination of two tools.
Note that you don’t need a lot of examples to begin with. The first few tests are used to set up an evaluation framework that can scale with you.
Run the Tests
query | expected | actual | time | precision | recall |
---|---|---|---|---|---|
Send an email to [email protected] about the project update | [send_email] | [] | 0.90 | 0.0 | 0.0 |
What meetings do I have scheduled for tomorrow? | [get_calendar_events] | [get_calendar_events] | 0.88 | 1.0 | 1.0 |
Set a reminder for my dentist appointment next week | [create_reminder] | [create_reminder] | 1.37 | 1.0 | 1.0 |
Check my calendar for next week’s meetings and set reminders for each one | [get_calendar_events, create_reminder] | [get_calendar_events] | 1.06 | 1.0 | 0.5 |
Look up my team meeting schedule and send the agenda to all participants | [get_calendar_events, send_email] | [get_calendar_events] | 1.19 | 1.0 | 0.5 |
Set a reminder for the client call and send a confirmation email to the team | [create_reminder, send_email] | [create_reminder, send_email] | 1.97 | 1.0 | 1.0 |
Our evaluation reveals interesting patterns in the model’s tool selection behavior: The model demonstrates good precision in tool selection - when it chooses to invoke a tool, it’s typically the right one for the task. This suggests the model has a strong understanding of each tool’s use cases. However, we observe lower recall scores in scenarios requiring multiple tool coordination. The model sometimes fails to recognize when a complex query necessitates multiple tools working together.
Consider the query: “Look at my team meeting schedule and send the agenda to all participants.” This requires:
- Retrieving calendar information (
get_calendar_events
) - Composing and sending an email (
send_email
)
We should also break down recall by tool category to identify which types of tools the model handles well and where it struggles. This can guide improvements like refining tool descriptions, renaming functions for clarity, or even removing tools that aren’t adding value.
tool | correct_calls | expected_calls | recall |
---|---|---|---|
get_calendar_events | 3 | 3 | 1.00 |
create_reminder | 2 | 3 | 0.67 |
send_email | 1 | 3 | 0.33 |
The model shows a clear preference hierarchy, with calendar queries being handled most reliably, followed by reminders, and then emails. This suggests that:
- The
send_email
tool may need improved descriptions or examples to better match user query patterns - Multi-tool coordination needs enhancement, particularly for action-oriented tools
This tool-specific analysis helps us target improvements where they’ll have the most impact, rather than making general changes to the entire system.
Conclusion
In this cookbook, we’ve demonstrated how to evaluate tool calling capabilities using objective metrics like precision and recall. By systematically analyzing tool selection performance, we’ve gained valuable insights into where our model excels and where it needs improvement.
Our evaluation revealed that the model achieves high precision (consistently selecting appropriate tools when it does make a selection) but struggles with recall for certain tools, particularly when multiple tools need to be coordinated. The send_email
tool showed the lowest recall (0.33), indicating it’s frequently overlooked even when needed.
This data-driven approach to tool evaluation offers several advantages over traditional methods:
- It provides objective metrics that can be tracked over time
- It identifies specific tools that need improvement rather than general system issues
- It highlights patterns in the model’s decision-making process that might not be obvious from manual testing
When building your own tool-enabled AI systems, remember that tool selection is as critical as the quality of the generated text. A model that writes beautifully but fails to take appropriate actions will ultimately disappoint users. By measuring precision and recall at both the query and tool level, you can systematically improve your system’s ability to take the right actions at the right time.
For the full notebook, check it out on: GitHub.