Capturing Evaluations & Guardrails
Learn how to log custom evaluations, trigger managed evaluations, and implement guardrails with LangWatch.
LangWatch provides a flexible system for capturing various types of evaluations and implementing guardrails within your LLM applications. This allows you to track performance, ensure quality, and control application flow based on defined criteria.
There are three main ways to work with evaluations and guardrails:
- Client-Side Custom Evaluations (
add_evaluation
): Log any custom evaluation metric, human feedback, or external system score directly from your Python code. These are primarily for observational purposes. - Server-Side Managed Evaluations (
evaluate
,async_evaluate
): Trigger predefined or custom evaluation logic that runs on the LangWatch backend. These can return scores, pass/fail results, and other details. - Guardrails: A special application of evaluations (either client-side or server-side) used to make decisions or enforce policies within your application flow.
1. Client-Side Custom Evaluations (add_evaluation
)
You can log custom evaluation data directly from your application code using the add_evaluation()
method on a LangWatchSpan
or LangWatchTrace
object. This is useful for recording metrics specific to your domain, results from external systems, or human feedback.
When you call add_evaluation()
, LangWatch typically creates a new child span of type evaluation
(or guardrail
if is_guardrail=True
) under the target span. This child span, named after your custom evaluation, stores its details, primarily in its output
attribute.
Here’s an example:
add_evaluation()
Parameters
The add_evaluation()
method is available on both LangWatchSpan
and LangWatchTrace
objects (when using on a trace, you must specify the target span
). For detailed parameter descriptions, please refer to the API reference:
2. Server-Side Managed Evaluations (evaluate
& async_evaluate
)
LangWatch allows you to trigger evaluations that are performed by the LangWatch backend. These can be built-in evaluators (e.g., for faithfulness, relevance) or custom evaluators you define in your LangWatch project settings.
You use the evaluate()
(synchronous) or async_evaluate()
(asynchronous) functions for this. These functions send the necessary data to the LangWatch API, which then processes the evaluation. These server-side evaluations are a core part of setting up real-time monitoring and evaluations in production.
evaluate()
/ async_evaluate()
Key Parameters
The evaluate()
and async_evaluate()
methods are available on both LangWatchSpan
and LangWatchTrace
objects. They can also be imported from langwatch.evaluations
and called as langwatch.evaluate()
or langwatch.async_evaluate()
, where you would then explicitly pass the span
or trace
argument. For detailed parameter descriptions, refer to the API reference:
LangWatchSpan.evaluate()
andLangWatchSpan.async_evaluate()
LangWatchTrace.evaluate()
andLangWatchTrace.async_evaluate()
Understanding the data
Parameter:
The core parameters like slug
, data
, settings
, as_guardrail
, span
, and trace
are generally consistent.
For the data
parameter specifically: while BasicEvaluateData
is commonly used to provide a standardized structure for input
, output
, and contexts
(which many built-in or common evaluators expect), it’s important to know that data
can be any dictionary. This flexibility allows you to pass arbitrary data structures tailored to custom server-side evaluators you might define. Using BasicEvaluateData
with fields like expected_output
is particularly useful when evaluating if the LLM is generating the right answers against a set of expected outputs. For scenarios where a golden answer isn’t available, LangWatch also supports more open-ended evaluations, such as using an LLM-as-a-judge.
The slug
parameter refers to the unique identifier of the evaluator configured in your LangWatch project settings. You can find a list of available evaluator types and learn how to configure them in our LLM Evaluation documentation.
The functions return an EvaluationResultModel
containing status
, passed
, score
, details
, label
, and cost
.
3. Guardrails
Guardrails are evaluations used to make decisions or enforce policies within your application. They typically result in a boolean passed
status that your code can act upon.
Using Server-Side Evaluations as Guardrails:
Set as_guardrail=True
when calling evaluate
or async_evaluate
.
A key behavior of as_guardrail=True
for server-side evaluations is that if the evaluation process itself encounters an error (e.g., the evaluator service is down), the result will have status="error"
but passed
will default to True
. This is a fail-safe to prevent your application from breaking due to an issue in the guardrail execution itself, assuming a “pass by default on error” stance is desired. For more on setting up safety-focused real-time evaluations like PII detection or prompt injection monitors, see our guide on Setting up Real-Time Evaluations.
Using Client-Side add_evaluation
as Guardrails:
Set is_guardrail=True
when calling add_evaluation
.
For client-side guardrails added with add_evaluation
, your code is fully responsible for interpreting the passed
status and handling any errors during the local check.
How Evaluations and Guardrails Appear in LangWatch
Both client-side and server-side evaluations (including those marked as guardrails) are logged as spans in LangWatch.
add_evaluation
: Creates a child span of typeevaluation
(orguardrail
ifis_guardrail=True
).evaluate
/async_evaluate
: Also create a child span of typeevaluation
(orguardrail
ifas_guardrail=True
).
These spans will contain the evaluation’s name, result (score, passed, label), details, cost, and any associated metadata, typically within their output
attribute. This allows you to:
- See a history of all evaluation outcomes.
- Filter traces by evaluation results.
- Analyze the performance of different evaluators or guardrails.
- Correlate evaluation outcomes with other trace data (e.g., LLM inputs/outputs, latencies).
Use Cases
- Quality Assurance:
- Client-Side: Log scores from a custom heuristic checking for politeness in responses.
- Server-Side: Trigger a managed “Toxicity” evaluator on LLM outputs, or use more open-ended approaches like an LLM-as-a-judge for tasks without predefined correct answers.
- Compliance & Safety:
- Client-Side Guardrail: Perform a regex check for forbidden words and log it with
is_guardrail=True
. - Server-Side Guardrail: Use a managed “PII Detection” evaluator with
as_guardrail=True
to decide if a response can be shown.
- Client-Side Guardrail: Perform a regex check for forbidden words and log it with
- Performance Monitoring:
- Client-Side: Log human feedback scores (
add_evaluation
) for helpfulness. - Server-Side: Evaluate RAG system outputs for “Context Relevancy” and “Faithfulness” using managed evaluators.
- Client-Side: Log human feedback scores (
- A/B Testing: Log custom metrics or trigger standard evaluations for different model versions or prompts to compare their performance.
- Feedback Integration:
add_evaluation
can be used to pipe scores from an external human review platform directly into the relevant trace.
By combining these methods, you can build a robust evaluation and guardrailing strategy tailored to your application’s needs, all observable within LangWatch.