Extensive Unit Testing
Welcome to the Extensive Unit Testing tutorial. This guide will explain how to create a comprehensive test suite for your LLM application using LangEvals. Our first example use case will focus on the Entity Extraction task. Imagine you have a list of addresses in unstructured text format, and you want to use an LLM to transform it into a spreadsheet. There are many questions you might have, such as which model to choose, how to determine the best model, and how often the model fails to produce the expected results.
Prepare the Data
The first step is to model our data using a Pydantic schema. This helps validate and structure the data, making it easier to serialize entries into JSON strings later.
Once we have modeled our data format, we can create a small dataset with three examples.
In this example entries
is a Pandas DataFrame object with two columns: input and expected_output. The expected_output column contains the expected results, which we will use to compare with the model’s responses during evaluation.
Evaluate different models
Now we can start our tests. Let’s compare different models. We define an array with the models we’re interested in and create a litellm client to perform the API calls to these models. Next, we create a test function and annotate it with @pytest
.
Our test function calls the LLM with entry.input
and compares the response with entry.expected_output
.
In this test we leverage @pytest.mark.parametrize
to run the same test function with different parameters. Using itertools.product, we pair each model with each entry, resulting in 9 different test cases.
Wow, right? Now you can see how each model performs on a larger scale.
Evaluate with a Pass Rate
LLMs are probabilistic by nature, meaning the results of the same test with the same input can vary. However, you can set a pass_rate
threshold to make the test suite pass even if some tests fail.
In this example we added the second @pytest
decorator that allows the test result to be a PASS even if only 60% of the tests are successful. For instance, if the LLM sometimes returns “United States” instead of “USA”, we can still consider it a pass if it meets our acceptable level of uncertainty.
Evaluate with Flaky
Flaky is a special PyTest extension designed for testing software systems that depend on non-deterministic tools such as network communication or AI/ML algorithms.
In this case, each combination of entry and model that fails during its test will be retried up to 2 more times before being marked as a failure. You can also specify the minimum number of passes required before marking the test as a PASS using - @pytest.mark.flaky(max_runs=3, min_passes=2)
.
LLM-as-a-Judge and expect
Lets take another use-case - generation of recipes. As the task becomes more nuanced it is also harder to properly evaluate the quality of LLM’s response.
LLM-as-a-Judge approach comes in hand in such situations. For example, you can use CustomLLMBooleanEvaluator
to check if the generated recipes are all vegetarian.
Pay attention how we use the expect
at the end of our test. This is a special assertion utility function that simplifies the
evaluation run and prints a nice output with the detailed explanation in case of failures.
The expect
utility interface is modeled after Jest assertions, so you can expect a somewhat similar API if you are expericed with Jest.
Other Evaluators
Just like CustomLLMBooleanEvaluator
, you can use any other evaluator available from LangEvals to prevent regression on a variety of cases,
for example, here we check that the LLM answers are always in english, regardless of the language used in the question, we also measure how relevant the answers are to the question:
In this example we are now not only validating a boolean assertion, but also making sure that 80% of our samples keep an answer relevancy score above 0.8 from the Ragas Answer Relevancy Evaluator.
Open in Notebook
You can access and run the code yourself in Jupyter Notebook