Importing the Library

Creating Evaluation Dataset

Run Evaluations

Access the Results

Batch Evaluation

LangWatch

Introduction

Self-Hosting

LLM Nodes

Datasets

Evaluating

Optimizing

Langflow Integration

Flowise Integration

REST API Integration

RAG Context Tracking

Concepts

Evaluations

Custom Evaluator Integration

Overview

Setting Up Guardrails

Quickstart

DSPy Visualization Quickstart

Custom Optimizer Tracking

Tracking Custom DSPy Optimizer

RAG Visualization

Triggers

Annotations

Embedded Analytics

Troubleshooting and Support

Status Page

LangEvals

Unit Tests

API Example

How to Choose Your Evaluator Guide

Extensive Unit Testing

RAG Evaluation

Evaluate on CI/CD Pipeline

This evaluator detects the language of the input and output text to check for example if the generated answer is in the same language as the prompt,
or if it's in a specific expected language.


__Docs:__ https://github.com/pemistahl/lingua-py

Language Detection

This evaluator uses OpenAI's moderation API to detect potentially harmful content in text,
including harassment, hate speech, self-harm, sexual content, and violence.


__Env vars:__ OPENAI_API_KEY

__Docs:__ https://platform.openai.com/docs/guides/moderation/overview

OpenAI Moderation

Google DLP PII detects personally identifiable information in text, including phone numbers, email addresses, and
social security numbers. It allows customization of the detection threshold and the specific types of PII to check.


__Env vars:__ GOOGLE_APPLICATION_CREDENTIALS

__Docs:__ https://cloud.google.com/sensitive-data-protection/docs/apis

PII Detection

This evaluator detects potentially unsafe content in text, including hate speech,
self-harm, sexual content, and violence. It allows customization of the severity
threshold and the specific categories to check.


__Env vars:__ AZURE_CONTENT_SAFETY_ENDPOINT, AZURE_CONTENT_SAFETY_KEY

__Docs:__ https://learn.microsoft.com/en-us/azure/ai-services/content-safety/quickstart-text

Content Safety

This evaluator checks for jailbreak-attempt in the input using Azure's Content Safety API.


__Env vars:__ AZURE_CONTENT_SAFETY_ENDPOINT, AZURE_CONTENT_SAFETY_KEY

Jailbreak Detection

This evaluator checks for prompt injection attempt in the input and the contexts using Azure's Content Safety API.


__Env vars:__ AZURE_CONTENT_SAFETY_ENDPOINT, AZURE_CONTENT_SAFETY_KEY

__Docs:__ https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection

Prompt Injection Detection

This evaluator use an LLM-as-judge to check if the conversation is related to competitors, without having to name them explicitly


__Env vars:__ OPENAI_API_KEY, AZURE_API_KEY, AZURE_API_BASE

Competitor Detection with LLM

This evaluator checks if any of the specified competitors was mentioned


__Docs:__ https://path/to/official/docs

This evaluator checks if the user message is concerning one of the allowed topics of the chatbot


__Env vars:__ OPENAI_API_KEY, AZURE_API_KEY, AZURE_API_BASE

Off-Topic Detection

This evaluator focuses on assessing how pertinent the generated answer is to the given prompt. Higher scores indicate better relevancy.


__Docs:__ https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html

Answer Relevancy

This metric evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Higher scores indicate better precision.


__Docs:__ https://docs.ragas.io/en/latest/concepts/metrics/context_precision.html

Context Precision

This evaluator measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. Higher values indicate better performance.


__Docs:__ https://docs.ragas.io/en/latest/concepts/metrics/context_recall.html

Context Recall

This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.


__Docs:__ https://docs.ragas.io/en/latest/concepts/metrics/context_relevancy.html

Context Relevancy

This metric evaluates whether all of the output relevant items present in the contexts are ranked higher or not. Higher scores indicate better utilization.


__Docs:__ https://docs.ragas.io/en/latest/concepts/metrics/context_precision.html

Context Utilization

This evaluator assesses the extent to which the generated answer is consistent with the provided context. Higher scores indicate better faithfulness to the context, useful for detecting hallucinations.


__Docs:__ https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html

Faithfulness

This evaluator assesses the extent to which the generated answer is consistent with the provided context. Higher scores indicate better faithfulness to the context, useful for detecting hallucinations.


__Docs:__ https://docs.haystack.deepset.ai/docs/faithfulnessevaluator

Haystack Faithfulness

Use an LLM as a judge with a custom prompt to do a true/false boolean evaluation of the message.

LLM Boolean Evaluator

Use an LLM as a judge with custom prompt to do a numeric score evaluation of the message.

LLM Score Evaluator

Allows you to check for simple text matches or regex evaluation.

LLM Basic Evaluator

Allows you to check for semantic similarity or dissimilarity between input and output and a
target value, so you can avoid sentences that you don't want to be present without having to
match on the exact text.


__Env vars:__ OPENAI_API_KEY, AZURE_API_KEY, AZURE_API_BASE

LLM Similarity Evaluator

This evaluator is a special version of Llama trained strictly
for acting as a guardrail, following customizable guidelines.
It can work both as a safety evaluator and as policy enforcement.


__Env vars:__ CLOUDFLARE_ACCOUNT_ID, CLOUDFLARE_API_KEY

__Docs:__ https://huggingface.co/meta-llama/LlamaGuard-7b

input	output	answer_relevancy	competitor_blocklist	competitor_blocklist_details
hello	hi	0.800714	True	None
how are you?	I am a chatbot, no feelings	0.813168	True	None
what is your name?	My name is Bob	0.971663	False	Competitors mentioned: Bob

Documentation

Tutorials

API Endpoints

Batch Evaluation

Importing the Library

Creating Evaluation Dataset

Run Evaluations

Access the Results

Documentation

Tutorials

API Endpoints

​Importing the Library

​Creating Evaluation Dataset

​Run Evaluations

​Access the Results

Importing the Library

Creating Evaluation Dataset

Run Evaluations

Access the Results