LangWatch provides a flexible system for capturing various types of evaluations and implementing guardrails within your LLM applications. This allows you to track performance, ensure quality, and control application flow based on defined criteria.

There are three main ways to work with evaluations and guardrails:

  1. Client-Side Custom Evaluations (add_evaluation): Log any custom evaluation metric, human feedback, or external system score directly from your Python code. These are primarily for observational purposes.
  2. Server-Side Managed Evaluations (evaluate, async_evaluate): Trigger predefined or custom evaluation logic that runs on the LangWatch backend. These can return scores, pass/fail results, and other details.
  3. Guardrails: A special application of evaluations (either client-side or server-side) used to make decisions or enforce policies within your application flow.

1. Client-Side Custom Evaluations (add_evaluation)

You can log custom evaluation data directly from your application code using the add_evaluation() method on a LangWatchSpan or LangWatchTrace object. This is useful for recording metrics specific to your domain, results from external systems, or human feedback.

When you call add_evaluation(), LangWatch typically creates a new child span of type evaluation (or guardrail if is_guardrail=True) under the target span. This child span, named after your custom evaluation, stores its details, primarily in its output attribute.

Here’s an example:

import langwatch

# Assume langwatch.setup() has been called

@langwatch.span(name="Generate Response")
def process_request(user_query: str):
    response_text = f"Response to: {user_query}"
    langwatch.get_current_span().update(output=response_text)

    # Example 1: A simple pass/fail custom evaluation
    contains_keyword = "LangWatch" in response_text
    langwatch.get_current_span().add_evaluation(
        name="Keyword Check: LangWatch",
        passed=contains_keyword,
        details=f"Checked for 'LangWatch'. Found: {contains_keyword}"
    )

    # Example 2: A custom score for response quality
    human_score = 4.5
    langwatch.get_current_span().add_evaluation(
        name="Human Review: Quality Score",
        score=human_score,
        label="Good",
        details="Reviewed by Jane Doe. Response is clear and relevant."
    )
    
    # Example 3: A client-side guardrail check
    is_safe = not ("unsafe_word" in response_text)
    langwatch.get_current_span().add_evaluation(
        name="Safety Check (Client-Side)",
        passed=is_safe,
        is_guardrail=True, # Mark this as a guardrail
        details=f"Content safety check. Passed: {is_safe}"
    )
    if not is_safe:
        # Potentially alter flow or log a critical warning
        print("Warning: Client-side safety check failed!")


    return response_text

@langwatch.trace(name="Process User Request")
def main():
    user_question = "Tell me about LangWatch."
    generated_response = process_request(user_question)
    print(f"Query: {user_question}")
    print(f"Response: {generated_response}")

if __name__ == "__main__":
    main()

add_evaluation() Parameters

The add_evaluation() method is available on both LangWatchSpan and LangWatchTrace objects (when using on a trace, you must specify the target span). For detailed parameter descriptions, please refer to the API reference:

2. Server-Side Managed Evaluations (evaluate & async_evaluate)

LangWatch allows you to trigger evaluations that are performed by the LangWatch backend. These can be built-in evaluators (e.g., for faithfulness, relevance) or custom evaluators you define in your LangWatch project settings.

You use the evaluate() (synchronous) or async_evaluate() (asynchronous) functions for this. These functions send the necessary data to the LangWatch API, which then processes the evaluation. These server-side evaluations are a core part of setting up real-time monitoring and evaluations in production.

import langwatch
from langwatch.evaluations import BasicEvaluateData
# from langwatch.types import RAGChunk # For RAG contexts

# Assume langwatch.setup() has been called

@langwatch.span()
def handle_rag_query(user_query: str):
    retrieved_contexts_str = [
        "LangWatch helps monitor LLM applications.",
        "Evaluations can be run on the server."
    ]
    # For richer context, use RAGChunk
    # retrieved_contexts_rag = [
    #     RAGChunk(content="LangWatch helps monitor LLM applications.", document_id="doc1"),
    #     RAGChunk(content="Evaluations can be run on the server.", document_id="doc2")
    # ]

    # Add the RAG contexts to the current span
    langwatch.get_current_span().update(contexts=retrieved_contexts_str)

    # Simulate LLM call
    llm_output = f"Based on the context, LangWatch is for monitoring and server-side evals."

    # Prepare data for server-side evaluation
    eval_data = BasicEvaluateData(
        input=user_query,
        output=llm_output,
        contexts=retrieved_contexts_str 
    )

    # Trigger a server-side "faithfulness" evaluation
    # The 'faithfulness-evaluator' slug must be configured in your LangWatch project
    try:
        faithfulness_result = langwatch.evaluate(
            slug="faithfulness-evaluator", # Slug of the evaluator in LangWatch
            name="Faithfulness Check (Server)",
            data=eval_data,
        )
        
        print(f"Faithfulness Evaluation Result: {faithfulness_result}")
        # faithfulness_result is an EvaluationResultModel(status, passed, score, details, etc.)

        # Example: Using it as a guardrail
        if faithfulness_result.passed is False:
            print("Warning: Faithfulness check failed!")

    except Exception as e:
        print(f"Error during server-side evaluation: {e}")

    return llm_output

@langwatch.trace()
def main():
    query = "What can LangWatch do with contexts?"
    response = handle_rag_query(query)
    print(f"Query: {query}")
    print(f"Response: {response}")

if __name__ == "__main__":
    main()

evaluate() / async_evaluate() Key Parameters

The evaluate() and async_evaluate() methods are available on both LangWatchSpan and LangWatchTrace objects. They can also be imported from langwatch.evaluations and called as langwatch.evaluate() or langwatch.async_evaluate(), where you would then explicitly pass the span or trace argument. For detailed parameter descriptions, refer to the API reference:

Understanding the data Parameter:

The core parameters like slug, data, settings, as_guardrail, span, and trace are generally consistent. For the data parameter specifically: while BasicEvaluateData is commonly used to provide a standardized structure for input, output, and contexts (which many built-in or common evaluators expect), it’s important to know that data can be any dictionary. This flexibility allows you to pass arbitrary data structures tailored to custom server-side evaluators you might define. Using BasicEvaluateData with fields like expected_output is particularly useful when evaluating if the LLM is generating the right answers against a set of expected outputs. For scenarios where a golden answer isn’t available, LangWatch also supports more open-ended evaluations, such as using an LLM-as-a-judge.

The slug parameter refers to the unique identifier of the evaluator configured in your LangWatch project settings. You can find a list of available evaluator types and learn how to configure them in our LLM Evaluation documentation.

The functions return an EvaluationResultModel containing status, passed, score, details, label, and cost.

3. Guardrails

Guardrails are evaluations used to make decisions or enforce policies within your application. They typically result in a boolean passed status that your code can act upon.

Using Server-Side Evaluations as Guardrails: Set as_guardrail=True when calling evaluate or async_evaluate.

# ... (inside a function with a current span)
eval_data = BasicEvaluateData(output=llm_response)
pii_check_result = langwatch.evaluate(
    slug="pii-detection-guardrail",
    data=eval_data,
    as_guardrail=True,
    span=langwatch.get_current_span()
)

if pii_check_result.passed is False:
    # Take action: sanitize response, return a canned message, etc.
    return "Response redacted due to PII."

A key behavior of as_guardrail=True for server-side evaluations is that if the evaluation process itself encounters an error (e.g., the evaluator service is down), the result will have status="error" but passed will default to True. This is a fail-safe to prevent your application from breaking due to an issue in the guardrail execution itself, assuming a “pass by default on error” stance is desired. For more on setting up safety-focused real-time evaluations like PII detection or prompt injection monitors, see our guide on Setting up Real-Time Evaluations.

Using Client-Side add_evaluation as Guardrails: Set is_guardrail=True when calling add_evaluation.

# ... (inside a function with a current span)
is_too_long = len(llm_response) > 1000
response_span.add_evaluation(
    name="Length Guardrail",
    passed=(not is_too_long),
    is_guardrail=True,
    details=f"Length: {len(llm_response)}. Max: 1000"
)
if is_too_long:
    # Take action: truncate response, ask for shorter output, etc.
    return llm_response[:1000] + "..."

For client-side guardrails added with add_evaluation, your code is fully responsible for interpreting the passed status and handling any errors during the local check.

How Evaluations and Guardrails Appear in LangWatch

Both client-side and server-side evaluations (including those marked as guardrails) are logged as spans in LangWatch.

  • add_evaluation: Creates a child span of type evaluation (or guardrail if is_guardrail=True).
  • evaluate/async_evaluate: Also create a child span of type evaluation (or guardrail if as_guardrail=True).

These spans will contain the evaluation’s name, result (score, passed, label), details, cost, and any associated metadata, typically within their output attribute. This allows you to:

  • See a history of all evaluation outcomes.
  • Filter traces by evaluation results.
  • Analyze the performance of different evaluators or guardrails.
  • Correlate evaluation outcomes with other trace data (e.g., LLM inputs/outputs, latencies).

Use Cases

  • Quality Assurance:
    • Client-Side: Log scores from a custom heuristic checking for politeness in responses.
    • Server-Side: Trigger a managed “Toxicity” evaluator on LLM outputs, or use more open-ended approaches like an LLM-as-a-judge for tasks without predefined correct answers.
  • Compliance & Safety:
    • Client-Side Guardrail: Perform a regex check for forbidden words and log it with is_guardrail=True.
    • Server-Side Guardrail: Use a managed “PII Detection” evaluator with as_guardrail=True to decide if a response can be shown.
  • Performance Monitoring:
  • A/B Testing: Log custom metrics or trigger standard evaluations for different model versions or prompts to compare their performance.
  • Feedback Integration: add_evaluation can be used to pipe scores from an external human review platform directly into the relevant trace.

By combining these methods, you can build a robust evaluation and guardrailing strategy tailored to your application’s needs, all observable within LangWatch.