Understand how to evaluate tools and components in your RAG pipeline—covering retrievers, embedding models, chunking strategies, and vector stores.
query | expected | actual | time | precision | recall |
---|---|---|---|---|---|
Send an email to [email protected] about the project update | [send_email] | [] | 0.90 | 0.0 | 0.0 |
What meetings do I have scheduled for tomorrow? | [get_calendar_events] | [get_calendar_events] | 0.88 | 1.0 | 1.0 |
Set a reminder for my dentist appointment next week | [create_reminder] | [create_reminder] | 1.37 | 1.0 | 1.0 |
Check my calendar for next week’s meetings and set reminders for each one | [get_calendar_events, create_reminder] | [get_calendar_events] | 1.06 | 1.0 | 0.5 |
Look up my team meeting schedule and send the agenda to all participants | [get_calendar_events, send_email] | [get_calendar_events] | 1.19 | 1.0 | 0.5 |
Set a reminder for the client call and send a confirmation email to the team | [create_reminder, send_email] | [create_reminder, send_email] | 1.97 | 1.0 | 1.0 |
get_calendar_events
)send_email
)tool | correct_calls | expected_calls | recall |
---|---|---|---|
get_calendar_events | 3 | 3 | 1.00 |
create_reminder | 2 | 3 | 0.67 |
send_email | 1 | 3 | 0.33 |
send_email
tool may need improved descriptions or examples to better match user query patternssend_email
tool showed the lowest recall (0.33), indicating it’s frequently overlooked even when needed.
This data-driven approach to tool evaluation offers several advantages over traditional methods: