Evaluating

The Importance of Evaluation

Evaluators are essential tools for measuring LLM output quality. When you have a reliable way to measure quality, it becomes much easier to:

Compare different LLM models
Test prompt variations
Validate feature additions
Ensure quality remains consistent during upgrades

Types of Evaluators

On the video, a few evaluators are introduced:

Exact Match Evaluator (0:56)

The simplest form of evaluation, perfect for classification tasks:

Compares LLM output directly with expected output
Uses straightforward string matching
Ideal for categorical outputs where precision is crucial
Works well when you need strict matching

Answer Correctness Evaluator (4:44)

Comparison with golden answers for factual accuracy:

Uses another LLM to assess if answers are factually equivalent
Looks beyond exact wording to evaluate semantic meaning
Particularly useful for QA systems and knowledge-based tasks
Can handle variations in phrasing while maintaining accuracy checking

LLM as Judge Evaluator (7:01)

Flexible evaluation for custom criteria:

Allows custom prompts to define evaluation criteria
Useful when you don’t have expected outputs
Can evaluate subjective qualities (conciseness, tone, style)
Returns boolean (true/false) or scored (0-1) results

Working with Evaluators

Setting Up Evaluators (1:32)

To implement an evaluator:

Drag and drop the desired evaluator onto your workflow
Connect appropriate inputs (output from LLM, expected output from dataset)
Configure any additional parameters or criteria
Run evaluation on individual examples or full test sets

Running Evaluations (2:28)

The evaluation process:

Select your test dataset
Choose appropriate evaluator
Run evaluation across all test examples
Review accuracy scores and individual results

Improving Results (9:14)

After setting up evaluation:

Make incremental changes to your workflow
Test impact immediately through re-evaluation
Track improvements in accuracy scores
Iterate on prompts and parameters based on results

Summary

Choose evaluators that match your quality criteria
Use multiple evaluators for different aspects of quality
Start with simple evaluators before moving to complex ones
Consider both strict and semantic matching depending on your use case
Use evaluation results to guide optimization efforts

The ability to properly evaluate LLM outputs sets the foundation for automated optimization, which will be covered in the next tutorial.

Get Started

Agent Simulations

LLM Observability

LLM Evaluation

Prompt Management

LLM Development

API Endpoints

Use Cases

Support

The Importance of Evaluation

Types of Evaluators

Exact Match Evaluator (0:56)

Answer Correctness Evaluator (4:44)

LLM as Judge Evaluator (7:01)

Working with Evaluators

Setting Up Evaluators (1:32)

Running Evaluations (2:28)

Improving Results (9:14)

Summary

Get Started

Agent Simulations

LLM Observability

LLM Evaluation

Prompt Management

LLM Development

API Endpoints

Use Cases

Support

​The Importance of Evaluation

​Types of Evaluators

​Exact Match Evaluator (0:56)

​Answer Correctness Evaluator (4:44)

​LLM as Judge Evaluator (7:01)

​Working with Evaluators

​Setting Up Evaluators (1:32)

​Running Evaluations (2:28)

​Improving Results (9:14)

​Summary

The Importance of Evaluation

Types of Evaluators

Exact Match Evaluator (0:56)

Answer Correctness Evaluator (4:44)

LLM as Judge Evaluator (7:01)

Working with Evaluators

Setting Up Evaluators (1:32)

Running Evaluations (2:28)

Improving Results (9:14)

Summary