1. The Problem
Our LLM’s job is to interpret short chat messages and extract key details for a ride booking.- Input: A vague user message like
"Schiphol, 2 people"
or"Herengracht 500 now"
. - Output: A structured JSON object with the booking details.
pickup_address
, airport_found
, and passenger_count
, even when the input is incomplete.
2. Setup and Data Preparation
First, let’s set up our environment and create a simple dataset for the evaluation. Our dataset will be a pandas DataFrame with theuser_message
and a ground_truth
column containing the expected JSON output.
3. Define the Extraction Logic
Next, we’ll define a placeholder function,extract_booking_details()
, that simulates our LLM pipeline. This function takes a user message and returns a JSON object with the extracted details.
This is where you would integrate your actual LLM calls (e.g., using OpenAI, Anthropic, or a local model).
4. Implementing the Evaluation Script
Now, let’s tie it all together with LangWatch. We’ll initialize an evaluation, loop through our dataset, call our model, and log custom metrics to track the accuracy of each extracted field. This script gives us a precise, field-by-field view of our model’s performance.5. Analyzing the Results
After running the script, you can navigate to the LangWatch dashboard to see your results. You’ll get:- High-Level Metrics: An overview of correctness scores across your dataset.
- Sample-by-Sample Breakdown: The ability to inspect each user message, see the model’s output vs. the expected output, and identify exactly where it failed.
- Historical Tracking: A record of all your evaluation runs, so you can easily compare model versions and track improvements over time.
hallucination_check
failed to debug why your model is inventing a destination_address. This level of detail is crucial for iterating on your prompts and improving model reliability.