A/B Testing

LangWatch enables A/B testing by allowing you to create different versions of your prompts and randomly alternate between them. Your application can test different prompt variants while LangWatch tracks performance metrics for each version.

How It Works

Create variants as different versions of the same prompt
Switch between versions at runtime with an A/B testing strategy
Track performance using LangWatch’s built-in analytics
Compare results to see which version performs better

Implementation

Create Prompt Variants

Create different versions of your prompt for testing:

TypeScript SDK
Python SDK

import { LangWatch } from "langwatch";

const langwatch = new LangWatch({
  apiKey: process.env.LANGWATCH_API_KEY
});

// Create base prompt
const basePrompt = await langwatch.prompts.create({
  handle: "customer-support-bot",
  scope: "PROJECT",
  prompt: "You are a helpful customer support agent. Help with: {{input}}",
  inputs: [{ identifier: "input", type: "str" }],
  outputs: [{ identifier: "response", type: "str" }],
  model: "openai/gpt-4o-mini"
});

// Create variant A (friendly tone) - captures version number
const variantA = await langwatch.prompts.update("customer-support-bot", {
  prompt: "You are a friendly and empathetic customer support agent. Use a warm, helpful tone. Help with: {{input}}"
});

// Create variant B (professional tone) - captures version number
const variantB = await langwatch.prompts.update("customer-support-bot", {
  prompt: "You are a professional and efficient customer support agent. Be concise and solution-focused. Help with: {{input}}"
});

// Store version numbers for A/B testing
const versions = {
  base: basePrompt.version,
  friendly: variantA.version,
  professional: variantB.version
};

console.log("Version numbers:", versions);

Run A/B Tests

Use the captured version numbers to switch between prompt versions at runtime (random sampling):

TypeScript SDK
Python SDK

async function generateResponse(userInput: string) {
  // Use the captured version numbers
  const versions = {
    base: 1,
    friendly: 2,
    professional: 3
  };
  
  // Randomly select a variant
  const variants = [
    { version: versions.base, description: "Base version" },
    { version: versions.friendly, description: "Friendly tone" },
    { version: versions.professional, description: "Professional tone" }
  ];
  
  const randomVariant = variants[Math.floor(Math.random() * variants.length)];
  
  // Fetch the selected prompt version
  const prompt = await langwatch.prompts.get("customer-support-bot", {
    version: randomVariant.version
  });
  
  // Compile and use the prompt
  const compiledPrompt = prompt.compile({ input: userInput });
  
  // Use with your LLM client
  const result = await generateText({
    model: openai(prompt.model.replace("openai/", "")),
    messages: compiledPrompt.messages
  });
  
  return {
    response: result.text,
    version: randomVariant.version,
    description: randomVariant.description
  };
}

Track Performance

LangWatch automatically tracks performance metrics for each prompt version:

Response latency - Which version is faster?
Token usage - Which version is more efficient?
Cost per request - Which version is more cost-effective?
Quality scores - Which version produces better responses?

Analyze Results

Compare metrics between versions in the LangWatch UI to see which variant performs better. Use this data to make informed decisions about which prompt version to use in production.

Get Started

Observability

Agent Simulations

Evaluation

Prompt Management

Platform

Examples & Cookbooks

How It Works

Implementation

Create Prompt Variants

Run A/B Tests

Track Performance

Analyze Results

Get Started

Observability

Agent Simulations

Evaluation

Prompt Management

Platform

Examples & Cookbooks

​How It Works

​Implementation

​Create Prompt Variants

​Run A/B Tests

​Track Performance

​Analyze Results

How It Works

Implementation

Create Prompt Variants

Run A/B Tests

Track Performance

Analyze Results