Learn how to log custom evaluations, trigger managed evaluations, and implement guardrails with LangWatch.
add_evaluation
): Log any custom evaluation metric, human feedback, or external system score directly from your Python code. These are primarily for observational purposes.evaluate
, async_evaluate
): Trigger predefined or custom evaluation logic that runs on the LangWatch backend. These can return scores, pass/fail results, and other details.add_evaluation
)add_evaluation()
method on a LangWatchSpan
or LangWatchTrace
object. This is useful for recording metrics specific to your domain, results from external systems, or human feedback.
When you call add_evaluation()
, LangWatch typically creates a new child span of type evaluation
(or guardrail
if is_guardrail=True
) under the target span. This child span, named after your custom evaluation, stores its details, primarily in its output
attribute.
Here’s an example:
add_evaluation()
Parametersadd_evaluation()
method is available on both LangWatchSpan
and LangWatchTrace
objects (when using on a trace, you must specify the target span
). For detailed parameter descriptions, please refer to the API reference:
evaluate
& async_evaluate
)evaluate()
(synchronous) or async_evaluate()
(asynchronous) functions for this. These functions send the necessary data to the LangWatch API, which then processes the evaluation. These server-side evaluations are a core part of setting up real-time monitoring and evaluations in production.
evaluate()
/ async_evaluate()
Key Parametersevaluate()
and async_evaluate()
methods are available on both LangWatchSpan
and LangWatchTrace
objects. They can also be imported from langwatch.evaluations
and called as langwatch.evaluate()
or langwatch.async_evaluate()
, where you would then explicitly pass the span
or trace
argument. For detailed parameter descriptions, refer to the API reference:
LangWatchSpan.evaluate()
and LangWatchSpan.async_evaluate()
LangWatchTrace.evaluate()
and LangWatchTrace.async_evaluate()
data
Parameter:The core parameters like slug
, data
, settings
, as_guardrail
, span
, and trace
are generally consistent.
For the data
parameter specifically: while BasicEvaluateData
is commonly used to provide a standardized structure for input
, output
, and contexts
(which many built-in or common evaluators expect), it’s important to know that data
can be any dictionary. This flexibility allows you to pass arbitrary data structures tailored to custom server-side evaluators you might define. Using BasicEvaluateData
with fields like expected_output
is particularly useful when evaluating if the LLM is generating the right answers against a set of expected outputs. For scenarios where a golden answer isn’t available, LangWatch also supports more open-ended evaluations, such as using an LLM-as-a-judge.slug
parameter refers to the unique identifier of the evaluator configured in your LangWatch project settings. You can find a list of available evaluator types and learn how to configure them in our LLM Evaluation documentation.
The functions return an EvaluationResultModel
containing status
, passed
, score
, details
, label
, and cost
.
passed
status that your code can act upon.
Using Server-Side Evaluations as Guardrails:
Set as_guardrail=True
when calling evaluate
or async_evaluate
.
as_guardrail=True
for server-side evaluations is that if the evaluation process itself encounters an error (e.g., the evaluator service is down), the result will have status="error"
but passed
will default to True
. This is a fail-safe to prevent your application from breaking due to an issue in the guardrail execution itself, assuming a “pass by default on error” stance is desired. For more on setting up safety-focused real-time evaluations like PII detection or prompt injection monitors, see our guide on Setting up Real-Time Evaluations.
Using Client-Side add_evaluation
as Guardrails:
Set is_guardrail=True
when calling add_evaluation
.
add_evaluation
, your code is fully responsible for interpreting the passed
status and handling any errors during the local check.
add_evaluation
: Creates a child span of type evaluation
(or guardrail
if is_guardrail=True
).evaluate
/async_evaluate
: Also create a child span of type evaluation
(or guardrail
if as_guardrail=True
).output
attribute. This allows you to:
is_guardrail=True
.as_guardrail=True
to decide if a response can be shown.add_evaluation
) for helpfulness.add_evaluation
can be used to pipe scores from an external human review platform directly into the relevant trace.