LangWatch offers an extensive library of evaluators to help you evaluate the quality and guarantee the safety of your LLM apps. While here you can find a reference list, to get the execution code you can use the Evaluation Wizard on LangWatch platform.

Evaluators List

Expected Answer Evaluation

For when you have the golden answer and want to measure how correct the LLM gets it
EvaluatorDescription
Exact Match EvaluatorA simple evaluator that checks if the output matches the expected_output exactly.
LLM Answer MatchUses an LLM to check if the generated output answers a question correctly the same way as the expected output, even if their style is different.
BLEU ScoreTraditional NLP metric. BLEU score for evaluating the similarity between two strings.
LLM Factual MatchComputes with an LLM how factually similar the generated answer is to the expected output.
ROUGE ScoreTraditional NLP metric. ROUGE score for evaluating the similarity between two strings.
SQL Query EquivalenceChecks if the SQL query is equivalent to a reference one by using an LLM to infer if it would generate the same results given the table schemas.

LLM-as-Judge

For when you don’t have a golden answer, but have a set of rules for another LLM to evaluate quality
EvaluatorDescription
LLM-as-a-Judge Boolean EvaluatorUse an LLM as a judge with a custom prompt to do a true/false boolean evaluation of the message.
LLM-as-a-Judge Category EvaluatorUse an LLM as a judge with a custom prompt to classify the message into custom defined categories.
LLM-as-a-Judge Score EvaluatorUse an LLM as a judge with custom prompt to do a numeric score evaluation of the message.
Rubrics Based ScoringRubric-based evaluation metric that is used to evaluate responses. The rubric consists of descriptions for each score, typically ranging from 1 to 5

RAG Quality

For measuring the quality of your RAG, check for hallucinations with faithfulness and precision/recall
EvaluatorDescription
Ragas Context PrecisionThis metric evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Higher scores indicate better precision.
Ragas Context RecallThis evaluator measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. Higher values indicate better performance.
Ragas FaithfulnessThis evaluator assesses the extent to which the generated answer is consistent with the provided context. Higher scores indicate better faithfulness to the context, useful for detecting hallucinations.
Context F1Balances between precision and recall for context retrieval, increasing it means a better signal-to-noise ratio. Uses traditional string distance metrics.
Context PrecisionMeasures how accurate is the retrieval compared to expected contexts, increasing it means less noise in the retrieval. Uses traditional string distance metrics.
Context RecallMeasures how many relevant contexts were retrieved compared to expected contexts, increasing it means more signal in the retrieval. Uses traditional string distance metrics.
Ragas FaithfulnessThis evaluator assesses the extent to which the generated answer is consistent with the provided context. Higher scores indicate better faithfulness to the context, useful for detecting hallucinations.
Ragas Response Context PrecisionUses an LLM to measure the proportion of chunks in the retrieved context that were relevant to generate the output or the expected output.
Ragas Response Context RecallUses an LLM to measure how many of relevant documents attributable the claims in the output were successfully retrieved in order to generate an expected output.
Ragas Response RelevancyEvaluates how pertinent the generated answer is to the given prompt. Higher scores indicate better relevancy.

Quality Aspects Evaluation

For when you want to check the language, structure, style and other general quality metrics
EvaluatorDescription
Valid Format EvaluatorAllows you to check if the output is a valid json, markdown, python, sql, etc. For JSON, can optionally validate against a provided schema.
Lingua Language DetectionThis evaluator detects the language of the input and output text to check for example if the generated answer is in the same language as the prompt, or if it’s in a specific expected language.
Summarization ScoreMeasures how well the summary captures important information from the retrieved contexts.

Safety

Check for PII, prompt injection attempts and toxic content
EvaluatorDescription
Llama GuardThis evaluator is a special version of Llama trained strictly for acting as a guardrail, following customizable guidelines. It can work both as a safety evaluator and as policy enforcement.
Azure Content SafetyThis evaluator detects potentially unsafe content in text, including hate speech, self-harm, sexual content, and violence. It allows customization of the severity threshold and the specific categories to check.
Azure Jailbreak DetectionThis evaluator checks for jailbreak-attempt in the input using Azure’s Content Safety API.
Azure Prompt ShieldThis evaluator checks for prompt injection attempt in the input and the contexts using Azure’s Content Safety API.
OpenAI ModerationThis evaluator uses OpenAI’s moderation API to detect potentially harmful content in text, including harassment, hate speech, self-harm, sexual content, and violence.
Presidio PII DetectionDetects personally identifiable information in text, including phone numbers, email addresses, and social security numbers. It allows customization of the detection threshold and the specific types of PII to check.

Other

Miscellaneous evaluators
EvaluatorDescription
Custom Basic EvaluatorAllows you to check for simple text matches or regex evaluation.
Competitor BlocklistThis evaluator checks if any of the specified competitors was mentioned
Competitor Allowlist CheckThis evaluator use an LLM-as-judge to check if the conversation is related to competitors, without having to name them explicitly
Competitor LLM CheckThis evaluator implements LLM-as-a-judge with a function call approach to check if the message contains a mention of a competitor.
Off Topic EvaluatorThis evaluator checks if the user message is concerning one of the allowed topics of the chatbot
Query ResolutionThis evaluator checks if all the user queries in the conversation were resolved. Useful to detect when the bot doesn’t know how to answer or can’t help the user.
Semantic Similarity EvaluatorAllows you to check for semantic similarity or dissimilarity between input and output and a target value, so you can avoid sentences that you don’t want to be present without having to match on the exact text.
Ragas Answer CorrectnessComputes with an LLM a weighted combination of factual as well as semantic similarity between the generated answer and the expected output.
Ragas Answer RelevancyEvaluates how pertinent the generated answer is to the given prompt. Higher scores indicate better relevancy.
Ragas Context RelevancyThis metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.
Ragas Context UtilizationThis metric evaluates whether all of the output relevant items present in the contexts are ranked higher or not. Higher scores indicate better utilization.
Example EvaluatorThis evaluator serves as a boilerplate for creating new evaluators.

Running Evaluations

Set up your first evaluation using the Evaluation Wizard:

Instrumenting Custom Evaluator

If you have a custom evaluator built in-house, you can follow the guide below to integrate.