Requirements
Before starting, ensure you have the following packages installed:Setup
Start by setting up LangWatch to monitor your RAG application:Metrics
To start evaluating, you need to do 3 things:- Define the tools that your model can call
- Define an evaluation dataset of queries and corresponding expected tool calls
- Define a function to calculate precision and recall.
- Precision: The ratio of correct tool calls to total tool calls
- Recall: The ratio of correct tool calls to total possible tool calls
Defining Tools
Let’s start by defining our tools. When starting out, you can define a small set of 3-4 tools to evaluate. Once the evaluation framework is set in place, you can scale the number of tools to evaluate. For this application, I’ll be looking at 3 tools: get calendar events, create reminder, and send email about the event.Define an Eval Set
Now that we have our tools defined, we can define an eval set. I’ll test the model for its ability to call a single and a combination of two tools.Run the Tests
query | expected | actual | time | precision | recall |
---|---|---|---|---|---|
Send an email to [email protected] about the project update | [send_email] | [] | 0.90 | 0.0 | 0.0 |
What meetings do I have scheduled for tomorrow? | [get_calendar_events] | [get_calendar_events] | 0.88 | 1.0 | 1.0 |
Set a reminder for my dentist appointment next week | [create_reminder] | [create_reminder] | 1.37 | 1.0 | 1.0 |
Check my calendar for next week’s meetings and set reminders for each one | [get_calendar_events, create_reminder] | [get_calendar_events] | 1.06 | 1.0 | 0.5 |
Look up my team meeting schedule and send the agenda to all participants | [get_calendar_events, send_email] | [get_calendar_events] | 1.19 | 1.0 | 0.5 |
Set a reminder for the client call and send a confirmation email to the team | [create_reminder, send_email] | [create_reminder, send_email] | 1.97 | 1.0 | 1.0 |
- Retrieving calendar information (
get_calendar_events
) - Composing and sending an email (
send_email
)
tool | correct_calls | expected_calls | recall |
---|---|---|---|
get_calendar_events | 3 | 3 | 1.00 |
create_reminder | 2 | 3 | 0.67 |
send_email | 1 | 3 | 0.33 |
- The
send_email
tool may need improved descriptions or examples to better match user query patterns - Multi-tool coordination needs enhancement, particularly for action-oriented tools
Conclusion
In this cookbook, we’ve demonstrated how to evaluate tool calling capabilities using objective metrics like precision and recall. By systematically analyzing tool selection performance, we’ve gained valuable insights into where our model excels and where it needs improvement. Our evaluation revealed that the model achieves high precision (consistently selecting appropriate tools when it does make a selection) but struggles with recall for certain tools, particularly when multiple tools need to be coordinated. Thesend_email
tool showed the lowest recall (0.33), indicating it’s frequently overlooked even when needed.
This data-driven approach to tool evaluation offers several advantages over traditional methods:
- It provides objective metrics that can be tracked over time
- It identifies specific tools that need improvement rather than general system issues
- It highlights patterns in the model’s decision-making process that might not be obvious from manual testing