Evaluation
List of Evaluators
- Introduction
- Self Hosting
- Hybrid Setup
LLM Observability
- Overview
- Concepts
- Language APIs & SDKs
- User Events
- Monitoring & Alerts
- Code Examples
LLM Evaluation
LLM Development
- Prompt Optimization Studio
- DSPy Visualization
- LangWatch MCP
API Endpoints
- Traces
- Annotations
- Datasets
Evaluation
List of Evaluators
Find the evaluator for your use case
LangWatch offers an extensive library of evaluators to help you evaluate the quality and guarantee the safety of your LLM apps.
Evaluators List
Evaluator | Description |
---|---|
Exact Match Evaluator | A simple evaluator that checks if the output matches the expected_output exactly. |
LLM Answer Match | Uses an LLM to check if the generated output answers a question correctly the same way as the expected output, even if their style is different. |
LLM Factual Match | Computes with an LLM how factually similar the generated answer is to the expected output. |
SQL Query Equivalence | Checks if the SQL query is equivalent to a reference one by using an LLM to infer if it would generate the same results given the table schemas. |
ROUGE Score | Traditional NLP metric. ROUGE score for evaluating the similarity between two strings. |
BLEU Score | Traditional NLP metric. BLEU score for evaluating the similarity between two strings. |
Evaluator | Description |
---|---|
LLM-as-a-Judge Boolean Evaluator | Use an LLM as a judge with a custom prompt to do a true/false boolean evaluation of the message. |
LLM-as-a-Judge Score Evaluator | Use an LLM as a judge with custom prompt to do a numeric score evaluation of the message. |
LLM-as-a-Judge Category Evaluator | Use an LLM as a judge with a custom prompt to classify the message into custom defined categories. |
Rubrics Based Scoring | Rubric-based evaluation metric that is used to evaluate responses. The rubric consists of descriptions for each score, typically ranging from 1 to 5 |
Evaluator | Description |
---|---|
Ragas Faithfulness | This evaluator assesses the extent to which the generated answer is consistent with the provided context. Higher scores indicate better faithfulness to the context, useful for detecting hallucinations. |
Ragas Response Relevancy | Evaluates how pertinent the generated answer is to the given prompt. Higher scores indicate better relevancy. |
Ragas Response Context Recall | Uses an LLM to measure how many of relevant documents attributable the claims in the output were successfully retrieved in order to generate an expected output. |
Ragas Response Context Precision | Uses an LLM to measure the proportion of chunks in the retrieved context that were relevant to generate the output or the expected output. |
Context F1 | Balances between precision and recall for context retrieval, increasing it means a better signal-to-noise ratio. Uses traditional string distance metrics. |
Context Precision | Measures how accurate is the retrieval compared to expected contexts, increasing it means less noise in the retrieval. Uses traditional string distance metrics. |
Context Recall | Measures how many relevant contexts were retrieved compared to expected contexts, increasing it means more signal in the retrieval. Uses traditional string distance metrics. |
Evaluator | Description |
---|---|
Lingua Language Detection | This evaluator detects the language of the input and output text to check for example if the generated answer is in the same language as the prompt, or if it’s in a specific expected language. |
Summarization Score | Measures how well the summary captures important information from the retrieved contexts. |
Valid Format Evaluator | Allows you to check if the output is a valid json, markdown, python, sql, etc. For JSON, can optionally validate against a provided schema. |
Evaluator | Description |
---|---|
PII Detection | Detects personally identifiable information in text, including phone numbers, email addresses, and social security numbers. It allows customization of the detection threshold and the specific types of PII to check. |
Prompt Injection / Jailbreak Detection | Detect prompt injection attempts and jailbreak attempts in the input |
Content Safety | Detect potentially unsafe content in text, including hate speech, self-harm, sexual content, and violence. It allows customization of the severity threshold and the specific categories to check. |
Running Evaluations
Set up your first evaluation using the Evaluation Wizard:

Instrumenting Custom Evaluator
If you have a custom evaluator built in-house, you can follow the guide below to integrate.
Was this page helpful?