LangWatch home page
Search...
⌘K
llms.txt
Support
Dashboard
langwatch/langwatch
langwatch/langwatch
Search...
Navigation
Prompt Optimization Studio
Evaluating
Documentation
Open Dashboard
GitHub Repo
Get Started
Introduction
Self Hosting
Cookbooks
Agent Simulations
Introduction to Agent Testing
Overview
Getting Started
Simulation Sets
Batch Runs
Individual Run View
LLM Observability
Overview
Concepts
Quick Start
Language APIs & SDKs
Integrations
Tutorials
User Events
Monitoring & Alerts
Code Examples
LLM Evaluation
Overview
Evaluation Tracking API
Evaluation Wizard
Real-Time Evaluation
Built-in Evaluators
Datasets
Annotations
LLM Development
Prompt Optimization Studio
Optimization Studio
LLM Nodes
Datasets
Evaluating
Optimizing
DSPy Visualization
LangWatch MCP
Prompt Versioning
API Endpoints
Traces
Scenarios
Prompts
Annotations
Datasets
Triggers
Use Cases
Evaluating a RAG Chatbot for Technical Manuals
Evaluating an AI Coach with LLM-as-a-Judge
Evaluating Structured Data Extraction
Support
Troubleshooting and Support
Status Page
On this page
The Importance of Evaluation
Types of Evaluators
Exact Match Evaluator (0:56)
Answer Correctness Evaluator (4:44)
LLM as Judge Evaluator (7:01)
Working with Evaluators
Setting Up Evaluators (1:32)
Running Evaluations (2:28)
Improving Results (9:14)
Summary
Prompt Optimization Studio
Evaluating
Copy page
Measure the quality of your LLM workflows
Copy page
The Importance of Evaluation
Evaluators are essential tools for measuring LLM output quality. When you have a reliable way to measure quality, it becomes much easier to:
Compare different LLM models
Test prompt variations
Validate feature additions
Ensure quality remains consistent during upgrades
Types of Evaluators
On the video, a few evaluators are introduced:
Exact Match Evaluator
(0:56)
The simplest form of evaluation, perfect for classification tasks:
Compares LLM output directly with expected output
Uses straightforward string matching
Ideal for categorical outputs where precision is crucial
Works well when you need strict matching
Answer Correctness Evaluator
(4:44)
Comparison with golden answers for factual accuracy:
Uses another LLM to assess if answers are factually equivalent
Looks beyond exact wording to evaluate semantic meaning
Particularly useful for QA systems and knowledge-based tasks
Can handle variations in phrasing while maintaining accuracy checking
LLM as Judge Evaluator
(7:01)
Flexible evaluation for custom criteria:
Allows custom prompts to define evaluation criteria
Useful when you don’t have expected outputs
Can evaluate subjective qualities (conciseness, tone, style)
Returns boolean (true/false) or scored (0-1) results
Working with Evaluators
Setting Up Evaluators
(1:32)
To implement an evaluator:
Drag and drop the desired evaluator onto your workflow
Connect appropriate inputs (output from LLM, expected output from dataset)
Configure any additional parameters or criteria
Run evaluation on individual examples or full test sets
Running Evaluations
(2:28)
The evaluation process:
Select your test dataset
Choose appropriate evaluator
Run evaluation across all test examples
Review accuracy scores and individual results
Improving Results
(9:14)
After setting up evaluation:
Make incremental changes to your workflow
Test impact immediately through re-evaluation
Track improvements in accuracy scores
Iterate on prompts and parameters based on results
Summary
Choose evaluators that match your quality criteria
Use multiple evaluators for different aspects of quality
Start with simple evaluators before moving to complex ones
Consider both strict and semantic matching depending on your use case
Use evaluation results to guide optimization efforts
The ability to properly evaluate LLM outputs sets the foundation for automated optimization, which will be covered in the next tutorial.
Was this page helpful?
Yes
No
Datasets
Optimizing
Assistant
Responses are generated using AI and may contain mistakes.