- Client-Side Custom Evaluations (
add_evaluation): Log any custom evaluation metric, human feedback, or external system score directly from your Python code. These are primarily for observational purposes. - Server-Side Managed Evaluations (
evaluate,async_evaluate): Trigger predefined or custom evaluation logic that runs on the LangWatch backend. These can return scores, pass/fail results, and other details. - Guardrails: A special application of evaluations (either client-side or server-side) used to make decisions or enforce policies within your application flow.
1. Client-Side Custom Evaluations (add_evaluation)
You can log custom evaluation data directly from your application code using the add_evaluation() method on a LangWatchSpan or LangWatchTrace object. This is useful for recording metrics specific to your domain, results from external systems, or human feedback.
When you call add_evaluation(), LangWatch typically creates a new child span of type evaluation (or guardrail if is_guardrail=True) under the target span. This child span, named after your custom evaluation, stores its details, primarily in its output attribute.
Hereâs an example:
add_evaluation() Parameters
The add_evaluation() method is available on both LangWatchSpan and LangWatchTrace objects (when using on a trace, you must specify the target span). For detailed parameter descriptions, please refer to the API reference:
2. Server-Side Managed Evaluations (evaluate & async_evaluate)
LangWatch allows you to trigger evaluations that are performed by the LangWatch backend. These can be built-in evaluators (e.g., for faithfulness, relevance) or custom evaluators you define in your LangWatch project settings.
You use the evaluate() (synchronous) or async_evaluate() (asynchronous) functions for this. These functions send the necessary data to the LangWatch API, which then processes the evaluation. These server-side evaluations are a core part of setting up real-time monitoring and evaluations in production.
evaluate() / async_evaluate() Key Parameters
The evaluate() and async_evaluate() methods are available on both LangWatchSpan and LangWatchTrace objects. They can also be imported from langwatch.evaluations and called as langwatch.evaluate() or langwatch.async_evaluate(), where you would then explicitly pass the span or trace argument. For detailed parameter descriptions, refer to the API reference:
LangWatchSpan.evaluate()andLangWatchSpan.async_evaluate()LangWatchTrace.evaluate()andLangWatchTrace.async_evaluate()
Understanding the
data Parameter:The core parameters like slug, data, settings, as_guardrail, span, and trace are generally consistent.
For the data parameter specifically: while BasicEvaluateData is commonly used to provide a standardized structure for input, output, and contexts (which many built-in or common evaluators expect), itâs important to know that data can be any dictionary. This flexibility allows you to pass arbitrary data structures tailored to custom server-side evaluators you might define. Using BasicEvaluateData with fields like expected_output is particularly useful when evaluating if the LLM is generating the right answers against a set of expected outputs. For scenarios where a golden answer isnât available, LangWatch also supports more open-ended evaluations, such as using an LLM-as-a-judge.slug parameter refers to the unique identifier of the evaluator configured in your LangWatch project settings. You can find a list of available evaluator types and learn how to configure them in our LLM Evaluation documentation.
The functions return an EvaluationResultModel containing status, passed, score, details, label, and cost.
3. Guardrails
Guardrails are evaluations used to make decisions or enforce policies within your application. They typically result in a booleanpassed status that your code can act upon.
Using Server-Side Evaluations as Guardrails:
Set as_guardrail=True when calling evaluate or async_evaluate.
as_guardrail=True for server-side evaluations is that if the evaluation process itself encounters an error (e.g., the evaluator service is down), the result will have status="error" but passed will default to True. This is a fail-safe to prevent your application from breaking due to an issue in the guardrail execution itself, assuming a âpass by default on errorâ stance is desired. For more on setting up safety-focused real-time evaluations like PII detection or prompt injection monitors, see our guide on Setting up Real-Time Evaluations.
Using Client-Side add_evaluation as Guardrails:
Set is_guardrail=True when calling add_evaluation.
add_evaluation, your code is fully responsible for interpreting the passed status and handling any errors during the local check.
How Evaluations and Guardrails Appear in LangWatch
Both client-side and server-side evaluations (including those marked as guardrails) are logged as spans in LangWatch.add_evaluation: Creates a child span of typeevaluation(orguardrailifis_guardrail=True).evaluate/async_evaluate: Also create a child span of typeevaluation(orguardrailifas_guardrail=True).
output attribute. This allows you to:
- See a history of all evaluation outcomes.
- Filter traces by evaluation results.
- Analyze the performance of different evaluators or guardrails.
- Correlate evaluation outcomes with other trace data (e.g., LLM inputs/outputs, latencies).
Use Cases
- Quality Assurance:
- Client-Side: Log scores from a custom heuristic checking for politeness in responses.
- Server-Side: Trigger a managed âToxicityâ evaluator on LLM outputs, or use more open-ended approaches like an LLM-as-a-judge for tasks without predefined correct answers.
- Compliance & Safety:
- Client-Side Guardrail: Perform a regex check for forbidden words and log it with
is_guardrail=True. - Server-Side Guardrail: Use a managed âPII Detectionâ evaluator with
as_guardrail=Trueto decide if a response can be shown.
- Client-Side Guardrail: Perform a regex check for forbidden words and log it with
- Performance Monitoring:
- Client-Side: Log human feedback scores (
add_evaluation) for helpfulness. - Server-Side: Evaluate RAG system outputs for âContext Relevancyâ and âFaithfulnessâ using managed evaluators.
- Client-Side: Log human feedback scores (
- A/B Testing: Log custom metrics or trigger standard evaluations for different model versions or prompts to compare their performance.
- Feedback Integration:
add_evaluationcan be used to pipe scores from an external human review platform directly into the relevant trace.