Evaluations
Evaluations
Integration Guides
- Overview
- Python
- TypeScript
- OpenTelemetry
- Langflow Integration
- Flowise Integration
- REST API Integration
- RAG Context Tracking
- Concepts
- Cookbooks
Evaluations
Guardrails
User Events
- Overview
- Events
DSPy Visualization
API Endpoints
- Traces
- Annotations
Evaluations
Evaluations
LangWatch offers an extensive library of evaluators to help you evaluate the quality and guarantee the safety of your LLM apps. Those are very easy to set up on LangWatch dashboard.
Evaluators List
Evaluator | Description |
---|---|
Azure Jailbreak Detection | This evaluator checks for jailbreak-attempt in the input using Azure’s Content Safety API. |
Azure Content Safety | This evaluator detects potentially unsafe content in text, including hate speech, self-harm, sexual content, and violence. It allows customization of the severity threshold and the specific categories to check. |
Google Cloud DLP PII Detection | Google DLP PII detects personally identifiable information in text, including phone numbers, email addresses, and social security numbers. It allows customization of the detection threshold and the specific types of PII to check. |
Llama Guard | This evaluator is a special version of Llama trained strictly for acting as a guardrail, following customizable guidelines. It can work both as a safety evaluator and as policy enforcement. |
OpenAI Moderation | This evaluator uses OpenAI’s moderation API to detect potentially harmful content in text, including harassment, hate speech, self-harm, sexual content, and violence. |
Evaluator | Description |
---|---|
Competitor LLM check | This evaluator use an LLM-as-judge to check if the conversation is related to competitors, without having to name them explicitly |
Off Topic Evaluator | This evaluator checks if the user message is concerning one of the allowed topics of the chatbot |
Competitor Blocklist | This evaluator checks if any of the specified competitors was mentioned |
Product Sentiment Polarity | For messages about products, this evaluator checks for the nuanced sentiment direction of the LLM output, either very positive, subtly positive, subtly negative, or very negative. |
Evaluator | Description |
---|---|
Lingua Language Detection | This evaluator detects the language of the input and output text to check for example if the generated answer is in the same language as the prompt, or if it’s in a specific expected language. |
Query Resolution | This evaluator checks if all the user queries in the conversation were resolved. Useful to detect when the bot doesn’t know how to answer or can’t help the user. |
Ragas Context Recall | This evaluator measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. Higher values indicate better performance. |
Ragas Faithfulness | This evaluator assesses the extent to which the generated answer is consistent with the provided context. Higher scores indicate better faithfulness to the context. |
Ragas Context Utilization | This metric evaluates whether all of the output relevant items present in the contexts are ranked higher or not. Higher scores indicate better utilization. |
Ragas Context Relevancy | This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. |
Ragas Context Precision | This metric evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Higher scores indicate better precision. |
Ragas Answer Relevancy | This evaluator focuses on assessing how pertinent the generated answer is to the given prompt. Higher scores indicate better relevancy. |
Evaluator | Description |
---|---|
Semantic Similarity Evaluator | Allows you to check for semantic similarity or dissimilarity between input and output and a target value, so you can avoid sentences that you don’t want to be present without having to match on the exact text. |
Custom Basic Evaluator | Allows you to check for simple text matches or regex evaluation. |
Custom LLM Boolean Evaluator | Use an LLM as a judge with a custom prompt to do a true/false boolean evaluation of the message. |
Custom LLM Score Evaluator | Use an LLM as a judge with custom prompt to do a numeric score evaluation of the message. |
Custom Evaluator Integration
If you have a custom evaluator built in-house, you can follow the guide below to integrate.
On this page