Better Agents

Better Agents is a CLI tool and a set of standards for building reliable, testable, production-grade agents, independent of which framework you use. It supercharges your coding assistant (Kilocode, Claude Code, Cursor, etc.), making it an expert in any agent framework you choose (Agno, Mastra, LangGraph, etc.) and all their best practices. Use your preferred stack—Agno, Mastra, Vercel AI, Google ADK, or anything else. Better Agents doesn’t replace your stack, it stabilizes it.

Already have a project? Add evaluations, observability, and scenarios to your existing agent project. See the Integration Guide to get started.

Quick Start

Installation

Install Better Agents globally:

npm install -g @langwatch/better-agents

Or use with npx (no installation required):

npx @langwatch/better-agents init my-agent-project

Initialize a New Project

After installation, create a new Better Agents project:

# In current directory
better-agents init .

# In a new directory
better-agents init my-better-agent

The CLI will guide you through selecting your programming language, agent framework, coding assistant, LLM provider, and API keys.

Create Your First Project

After running the init command, navigate to your project:

cd my-better-agent

You’ll see a structure like this:

my-better-agent/
├── app/                    # Your agent implementation (or src/ for TypeScript)
│   └── agent.py            # Main agent code using your chosen framework
│
├── tests/
│   ├── scenarios/          # End-to-end conversational tests
│   │   └── example_scenario.test.py
│   └── evaluations/        # Component-level evaluation notebooks
│       └── example_eval.ipynb
│
├── prompts/                # Versioned prompt files (YAML format)
│   └── sample_prompt.yaml
│
├── prompts.json            # Prompt registry (syncs with LangWatch)
├── .mcp.json               # MCP configuration for coding assistants
├── AGENTS.md               # Development guidelines and best practices
├── .env                    # Environment variables (API keys, etc.)
└── .gitignore

Run Your First Scenario Test

Better Agents projects come with example scenario tests. Run them to see how agent testing works:

Python
TypeScript

pytest tests/scenarios/ -v

npm test

Once you run your first scenario, you’ll see results appear in your LangWatch project dashboard under the Simulations section.

Project Structure

Every Better Agents project follows a tested, scalable, maintainable layout. Here’s what each directory does:

`app/` or `src/`

Your actual agent code, written using your chosen framework. This is where you implement your agent’s logic, tools, and workflows.

`tests/scenarios/`

The core of real agent reliability. These aren’t unit tests—they’re conversational test cases that simulate real tasks and validate agent behavior across iterations, updates, or model swaps. Scenarios answer the most important question in AI engineering: Does the agent still behave the way we expect? Example scenario structure:

Python
TypeScript

tests/scenarios/example_scenario.test.py

import pytest
import scenario

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_customer_support_refund():
    """Test that agent handles refund requests correctly."""
    
    # Run the scenario
    result = await scenario.run(
        name="refund request",
        description="Customer requests refund for defective product",
        agents=[
            CustomerSupportAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent should acknowledge the issue",
                "Agent should check order status",
                "Agent should initiate refund process",
            ])
        ],
        script=[
            scenario.user("I want a refund for my order"),
            scenario.agent(),
            verify_refund_initiated,
            scenario.user("The product arrived damaged"),
            scenario.agent(),
            scenario.judge(),
        ],
    )
    
    assert result.success, result.reasoning

tests/scenarios/example_scenario.test.ts

import { describe, it, expect } from "vitest";
import scenario, { type AgentAdapter, AgentRole } from "@langwatch/scenario";
import { openai } from "@ai-sdk/openai";

describe("Customer Support Agent", () => {
  it("should handle refund requests", async () => {
    // Run the scenario
    const result = await scenario.run({
      name: "refund request",
      description: "Customer requests refund for defective product",
      agents: [
        customerSupportAgent,
        scenario.userSimulatorAgent({ model: openai("gpt-4o") }),
      ],
      script: [
        scenario.user("I want a refund for my order"),
        scenario.agent(),
        (state) => {
          expect(state.hasToolCall("check_order_status")).toBe(true);
        },
        scenario.user("The product arrived damaged"),
        scenario.agent(),
        scenario.succeed("Agent correctly handled refund request"),
      ],
    });
    
    expect(result.success).toBe(true);
  });
});

`tests/evaluations/`

Structured benchmarking for components like RAG correctness, retrieval F1 score, classification accuracy, and routing accuracy. LangWatch provides an extensive library of evaluators including answer correctness, LLM-as-judge, RAG quality metrics, safety checks, and more.

See the complete list of available evaluators in Evaluators List.

Evaluation notebooks allow teams to quantitatively test individual components:

tests/evaluations/rag_correctness.ipynb

import langwatch

# Evaluate RAG retrieval accuracy
results = langwatch.evaluate(
    dataset="customer-support-dataset",
    evaluator="rag_correctness",
    metric="f1_score"
)

print(f"RAG F1 Score: {results.f1_score}")

`prompts/`

Versioned prompt files in YAML format for team collaboration. Prompts are tracked, shared, and collaboratively improved—like real software. Example prompt structure:

prompts/customer-support.yaml

handle: customer-support-bot
scope: PROJECT
model: openai/gpt-4o-mini
prompt: |
  You are a helpful customer support agent.
  
  User inquiry: {{user_message}}
  
  Context:
  {{context}}
  
  Instructions:
  - Be polite and professional
  - Resolve the issue efficiently
  - Escalate if necessary

`prompts.json`

Prompt registry that controls which prompts are active and versioned. This file is versioned along with your codebase while also syncing to the LangWatch platform playground for collaboration.

`.mcp.json`

MCP server configuration that comes with all the right MCPs set up so your coding assistant becomes an expert in your framework of choice and in writing Scenario tests for your agent. It automatically discovers MCP tools and knows where to find new capabilities.

`AGENTS.md`

Development guidelines that ensure every new feature is properly tested, evaluated, and that prompts are versioned. This file guides your coding assistant to follow Better Agents best practices.

Core Concepts

Scenarios

Scenarios are end-to-end conversational tests that validate agent behavior in realistic, multi-turn conversations. Unlike static input-output tests, scenarios simulate how real users interact with your agent. Why scenarios matter:

Test agent behavior as a complete system
Catch regressions before they reach production
Validate complex workflows and edge cases
Ensure consistency across model updates

Scenarios complement evaluations by testing the agent as a whole system rather than isolated parts.

For detailed scenario testing documentation, see Agent Simulations.

Evaluations

Evaluations provide structured benchmarking for specific components of your agent pipeline. Examples include:

RAG correctness - Measure retrieval and generation accuracy
Retrieval F1 score - Evaluate search quality
Classification accuracy - Test routing and categorization
Routing accuracy - Validate decision-making logic

LangWatch offers an extensive library of evaluators covering answer correctness, LLM-as-judge metrics, RAG quality, safety checks, format validation, and more. See the complete evaluators list for all available options. Evaluations make AI development feel less like experimentation and more like engineering.

Learn more about evaluations in LLM Evaluation.

Prompt Versioning

Prompts are no longer ad-hoc artifacts. With Better Agents, they become:

Tracked - Full version history with easy rollback
Reviewable - Team collaboration on prompt improvements
Documented - Clear structure and purpose
Synced - Controlled by prompts-lock.json, versioned with codebase, synced to platform

This enables prompt management workflows similar to dependency management in traditional software.

For comprehensive prompt management features, see Prompt Management.

MCP Integration

The .mcp.json configuration enables your coding assistant to understand your agent framework and Better Agents standards.

Learn more about MCP integration in LangWatch MCP.

Get Started

Agent Simulations

Observability

Evaluation

Prompt Management

Platform

Examples & Cookbooks

Better Agents

Better Agents

Quick Start

Installation

Initialize a New Project

Create Your First Project

Run Your First Scenario Test

Project Structure

`app/` or `src/`

`tests/scenarios/`

`tests/evaluations/`

`prompts/`

`prompts.json`

`.mcp.json`

`AGENTS.md`

Core Concepts

Scenarios

Evaluations

Prompt Versioning

MCP Integration

Next Steps

Get Started

Agent Simulations

Observability

Evaluation

Prompt Management

Platform

Examples & Cookbooks

​Better Agents

​Quick Start

​Installation

​Initialize a New Project

​Create Your First Project

​Run Your First Scenario Test

​Project Structure

​app/ or src/

​tests/scenarios/

​tests/evaluations/

​prompts/

​prompts.json

​.mcp.json

​AGENTS.md

​Core Concepts

​Scenarios

​Evaluations

​Prompt Versioning

​MCP Integration

​Next Steps

Better Agents

Quick Start

Installation

Initialize a New Project

Create Your First Project

Run Your First Scenario Test

Project Structure

`app/` or `src/`

`tests/scenarios/`

`tests/evaluations/`

`prompts/`

`prompts.json`

`.mcp.json`

`AGENTS.md`

Core Concepts

Scenarios

Evaluations

Prompt Versioning

MCP Integration

Next Steps