Extensive Unit Testing

Welcome to the Extensive Unit Testing tutorial. This guide will explain how to create a comprehensive test suite for your LLM application using LangEvals. Our first example use case will focus on the Entity Extraction task. Imagine you have a list of addresses in unstructured text format, and you want to use an LLM to transform it into a spreadsheet. There are many questions you might have, such as which model to choose, how to determine the best model, and how often the model fails to produce the expected results.

Prepare the Data

The first step is to model our data using a Pydantic schema. This helps validate and structure the data, making it easier to serialize entries into JSON strings later.

from pydantic import BaseModel

class Address(BaseModel):
    number: int
    street_name: str
    city: str
    country: str

Once we have modeled our data format, we can create a small dataset with three examples.

import pandas as pd

entries = pd.DataFrame(
    {
        "input": [
            "Please send the package to 123 Main St, Springfield.",
            "J'ai déménagé récemment à 56 Rue de l'Université, Paris.",
            "A reunião será na Avenida Paulista, 900, São Paulo.",
        ],
        "expected_output": [
            Address(
                number=123, street_name="Main St", city="Springfield", country="USA"
            ).model_dump_json(),
            Address(
                number=56,
                street_name="Rue de l'Université",
                city="Paris",
                country="France",
            ).model_dump_json(),
            Address(
                number=900,
                street_name="Avenida Paulista",
                city="São Paulo",
                country="Brazil",
            ).model_dump_json(),
        ],
    }
)

In this example entries is a Pandas DataFrame object with two columns: input and expected_output. The expected_output column contains the expected results, which we will use to compare with the model’s responses during evaluation.

Evaluate different models

Now we can start our tests. Let’s compare different models. We define an array with the models we’re interested in and create a litellm client to perform the API calls to these models. Next, we create a test function and annotate it with @pytest.

Our test function calls the LLM with entry.input and compares the response with entry.expected_output.

from itertools import product
import pytest
import instructor
from litellm import completion

models = ["gpt-3.5-turbo", "gpt-4-turbo", "groq/llama3-70b-8192"]

client = instructor.from_litellm(completion)


@pytest.mark.parametrize("entry, model", product(entries.itertuples(), models))
def test_extracts_the_right_address(entry, model):
    address = client.chat.completions.create(
        model=model,
        response_model=Address,
        messages=[
            {"role": "user", "content": entry.input},
        ],
        temperature=0.0,
    )

    assert address.model_dump_json() == entry.expected_output

In this test we leverage @pytest.mark.parametrize to run the same test function with different parameters. Using itertools.product, we pair each model with each entry, resulting in 9 different test cases.

Wow, right? Now you can see how each model performs on a larger scale.

At the example test output we can see that groq/llama3-70b-8192 underperforms on this particular task.

..F..F..F                                                                                    [100%]
============================================= FAILURES =============================================
___________________ test_extracts_the_right_address[entry2-groq/llama3-70b-8192] ___________________

entry = Pandas(Index=0, input='Please send the package to 123 Main St, Springfield.', expected_output='{"number":123,"street_name":"Main St","city":"Springfield","country":"USA"}')
model = 'groq/llama3-70b-8192'

    @pytest.mark.parametrize("entry, model", product(entries.itertuples(), models))
    def test_extracts_the_right_address(entry, model):
        address = client.chat.completions.create(
            model=model,
            response_model=Address,
            messages=[
                {"role": "user", "content": entry.input},
            ],
            temperature=0.0,
        )

>       assert address.model_dump_json() == entry.expected_output
E       assert '{"number":12..."country":""}' == '{"number":12...untry":"USA"}'
E
E         Skipping 60 identical leading characters in diff, use -v to show
E         - country":"USA"}
E         ?           ---
E         + country":""}
...
FAILED t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry2-groq/llama3-70b-8192] - assert '{"number":12..."country":""}' == '{"number":12...untry":"USA"}'
FAILED t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry5-groq/llama3-70b-8192] - assert '{"number":56..."country":""}' == '{"number":56...ry":"France"}'
FAILED t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry8-groq/llama3-70b-8192] - assert '{"number":90..."country":""}' == '{"number":90...ry":"Brazil"}'

Evaluate with a Pass Rate

LLMs are probabilistic by nature, meaning the results of the same test with the same input can vary. However, you can set a pass_rate threshold to make the test suite pass even if some tests fail.

@pytest.mark.parametrize("entry, model", product(entries.itertuples(), models))
@pytest.mark.pass_rate(0.6)
def test_extracts_the_right_address(entry, model):
    address = client.chat.completions.create(
        model=model,
        response_model=Address,
        messages=[
            {"role": "user", "content": entry.input},
        ],
        temperature=0.0,
    )

    assert address.model_dump_json() == entry.expected_output

In this example we added the second @pytest decorator that allows the test result to be a PASS even if only 60% of the tests are successful. For instance, if the LLM sometimes returns “United States” instead of “USA”, we can still consider it a pass if it meets our acceptable level of uncertainty.

In this example, 3 test runs failed (same as in the previous output), marked with x. Despite the failures, the overall test suite passes because the pass_rate is set to 0.6, and 6 out of 9 tests passed.

..x..x..x                                                                                    [100%]
========================================= warnings summary =========================================
.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1285
  /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1285: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: anyio
    self._mark_plugins_for_rewrite(hook)

t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry0-gpt-3.5-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry1-gpt-4-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry2-groq/llama3-70b-8192]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry3-gpt-3.5-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry4-gpt-4-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry5-groq/llama3-70b-8192]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry6-gpt-3.5-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry7-gpt-4-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry8-groq/llama3-70b-8192]
  /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/instructor/process_response.py:222: DeprecationWarning: FUNCTIONS is deprecated and will be removed in future versions
    if mode == Mode.FUNCTIONS:

t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry0-gpt-3.5-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry1-gpt-4-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry2-groq/llama3-70b-8192]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry3-gpt-3.5-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry4-gpt-4-turbo]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry5-groq/llama3-70b-8192]
t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry6-gpt-3.5-turbo]
...
	[<TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/runner.py:341>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/flaky/flaky_pytest_plugin.py:146>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_hooks.py:513>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_manager.py:120>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:139>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:122>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/threadexception.py:87>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/threadexception.py:63>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:122>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/unraisableexception.py:90>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/unraisableexception.py:65>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:122>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/logging.py:850>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/logging.py:833>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:122>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/capture.py:878>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:122>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/skipping.py:257>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:103>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/runner.py:183>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/runner.py:173>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/python.py:1632>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_hooks.py:513>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_manager.py:120>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:139>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/pluggy/_callers.py:103>, <TracebackEntry /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks/.venv/lib/python3.12/site-packages/_pytest/python.py:162>, <TracebackEntry /var/folders/pj/64p7tp0s5rbc57h_h3qdws4c0000gn/T/ipykernel_12166/4222739897.py:64>]

===End Flaky Test Report===
6 passed, 3 xfailed, 19 warnings in 12.66s

Evaluate with Flaky

Flaky is a special PyTest extension designed for testing software systems that depend on non-deterministic tools such as network communication or AI/ML algorithms.

@pytest.mark.parametrize("entry, model", product(entries.itertuples(), models))
@pytest.mark.flaky(max_runs=3)
def test_extracts_the_right_address(entry, model):
    address = client.chat.completions.create(
        model=model,
        response_model=Address,
        messages=[
            {"role": "user", "content": entry.input},
        ],
        temperature=0.0,
    )

    assert address.model_dump_json() == entry.expected_output

In this case, each combination of entry and model that fails during its test will be retried up to 2 more times before being marked as a failure. You can also specify the minimum number of passes required before marking the test as a PASS using - @pytest.mark.flaky(max_runs=3, min_passes=2).

Notice the total testing runtime at the bottom of the snippet. It took 34.99 seconds to run these tests with retries compared to 10-12 seconds in the previous versions.

..F..F..                                                                                     [100%] [100%]F [100%]
============================================= FAILURES =============================================
___________________ test_extracts_the_right_address[entry2-groq/llama3-70b-8192] ___________________

entry = Pandas(Index=0, input='Please send the package to 123 Main St, Springfield.', expected_output='{"number":123,"street_name":"Main St","city":"Springfield","country":"USA"}')
model = 'groq/llama3-70b-8192'

    @pytest.mark.parametrize("entry, model", product(entries.itertuples(), models))
    @pytest.mark.flaky(max_runs=3)
    def test_extracts_the_right_address(entry, model):
        address = client.chat.completions.create(
            model=model,
            response_model=Address,
            messages=[
                {"role": "user", "content": entry.input},
            ],
            temperature=0.0,
        )

>       assert address.model_dump_json() == entry.expected_output
E       assert '{"number":12..."country":""}' == '{"number":12...untry":"USA"}'
E
E         Skipping 60 identical leading characters in diff, use -v to show
E         - country":"USA"}
E         ?           ---
...
FAILED t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry2-groq/llama3-70b-8192] - assert '{"number":12..."country":""}' == '{"number":12...untry":"USA"}'
FAILED t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry5-groq/llama3-70b-8192] - assert '{"number":56..."country":""}' == '{"number":56...ry":"France"}'
FAILED t_25254c19b35c4a58a520690924724e02.py::test_extracts_the_right_address[entry8-groq/llama3-70b-8192] - assert '{"number":90..."country":""}' == '{"number":90...ry":"Brazil"}'
3 failed, 6 passed, 35 warnings in 34.99s

LLM-as-a-Judge and `expect`

Lets take another use-case - generation of recipes. As the task becomes more nuanced it is also harder to properly evaluate the quality of LLM’s response. LLM-as-a-Judge approach comes in hand in such situations. For example, you can use CustomLLMBooleanEvaluator to check if the generated recipes are all vegetarian.

from langevals import expect
import litellm
import pandas as pd
import pytest

entries = pd.DataFrame(
    {
        "input": [
            "Generate me a recipe for a quick breakfast with bacon",
            "Generate me a recipe for a lunch using lentils",
            "Generate me a recipe for a vegetarian dessert",
        ],
    }
)

@pytest.mark.parametrize("entry", entries.itertuples())
@pytest.mark.flaky(max_runs=3)
@pytest.mark.pass_rate(0.8)
def test_generate_tweet_recipes(entry):
    response: ModelResponse = litellm.completion(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": "You are a tweet-size recipe generator, just recipe name and ingredients, no yapping.",
            },
            {"role": "user", "content": entry.input},
        ],
        temperature=0.0,
    )  # type: ignore
    recipe = response.choices[0].message.content  # type: ignore

    vegetarian_checker = CustomLLMBooleanEvaluator(
        settings=CustomLLMBooleanSettings(
            prompt="Is the recipe vegetarian?",
        )
    )

    expect(input=entry.input, output=recipe).to_pass(vegetarian_checker)

Pay attention how we use the expect at the end of our test. This is a special assertion utility function that simplifies the evaluation run and prints a nice output with the detailed explanation in case of failures. The expect utility interface is modeled after Jest assertions, so you can expect a somewhat similar API if you are expericed with Jest.

FAILED tests/test_llm_as_judge.py::test_llm_as_judge[entry0] - AssertionError: Custom LLM Boolean Evaluator to_pass FAILED - The recipe for a quick breakfast with bacon includes bacon strips, making it a non-vegetarian recipe.

Other Evaluators

Just like CustomLLMBooleanEvaluator, you can use any other evaluator available from LangEvals to prevent regression on a variety of cases, for example, here we check that the LLM answers are always in english, regardless of the language used in the question, we also measure how relevant the answers are to the question:

import litellm
from litellm import ModelResponse
from langevals_lingua.language_detection import (
    LinguaLanguageDetectionEvaluator,
    LinguaLanguageDetectionSettings,
    LinguaLanguageDetectionEvaluator,
)
from langevals_ragas.answer_relevancy import RagasAnswerRelevancyEvaluator
from langevals import expect
import pytest
entries = pd.DataFrame(
    {
        "input": [
            "What's the connection between 'breaking the ice' and the Titanic's first voyage?",
            "Comment la bataille de Verdun a-t-elle influencé la cuisine française?",
            "¿Puede el musgo participar en la purificación del aire en espacios cerrados?",
        ],
    }
)


@pytest.mark.parametrize("entry", entries.itertuples())
@pytest.mark.flaky(max_runs=3)
@pytest.mark.pass_rate(0.8)
def test_language_and_relevancy(entry):
    response: ModelResponse = litellm.completion(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": "You reply questions only in english, no matter tha language the question was asked",
            },
            {"role": "user", "content": entry.input},
        ],
        temperature=0.0,
    )  # type: ignore
    recipe = response.choices[0].message.content  # type: ignore

    language_checker = LinguaLanguageDetectionEvaluator(
        settings=LinguaLanguageDetectionSettings(
            check_for="output_matches_language",
            expected_language="EN",
        )
    )
    answer_relevancy_checker = RagasAnswerRelevancyEvaluator()

    expect(input=entry.input, output=recipe).to_pass(language_checker)
    expect(input=entry.input, output=recipe).score(
        answer_relevancy_checker
    ).to_be_greater_than(0.8)

In this example we are now not only validating a boolean assertion, but also making sure that 80% of our samples keep an answer relevancy score above 0.8 from the Ragas Answer Relevancy Evaluator.

===Flaky Test Report===

test_language_and_relevancy[entry0] passed 1 out of the required 1 times. Success!
test_language_and_relevancy[entry1] passed 1 out of the required 1 times. Success!
test_language_and_relevancy[entry2] passed 1 out of the required 1 times. Success!

===End Flaky Test Report===

Open in Notebook

You can access and run the code yourself in Jupyter Notebook

Documentation

Tutorials

API Endpoints

Extensive Unit Testing

Prepare the Data

Evaluate different models

Evaluate with a Pass Rate

Evaluate with Flaky

LLM-as-a-Judge and `expect`

Other Evaluators

Open in Notebook

Documentation

Tutorials

API Endpoints

​Prepare the Data

​Evaluate different models

​Evaluate with a Pass Rate

​Evaluate with Flaky

​LLM-as-a-Judge and expect

​Other Evaluators

Open in Notebook

Prepare the Data

Evaluate different models

Evaluate with a Pass Rate

Evaluate with Flaky

LLM-as-a-Judge and `expect`

Other Evaluators