Batch Evaluation
When exploring, it is common to generate multiple outputs from your LLM and then evaluate their performance scores, for example, using a Jupyter Notebook.
LangEvals provides the evaluate()
function to score these results in batch using diverse evaluators.
This section will guide you through batch evaluation with multiple evaluators and demonstrate how to conveniently access the results.
Importing the Library
First, import langevals
along with the evaluators that you will use.
import langevals
from langevals_ragas.answer_relevancy import RagasAnswerRelevancyEvaluator
from langevals_langevals.competitor_blocklist import (
CompetitorBlocklistEvaluator,
CompetitorBlocklistSettings,
)
Creating Evaluation Dataset
Next, create a pandas DataFrame with input
and output
columns.
Each column represents the input and output of the LLM.
It is important to name these columns exactly as shown to ensure compatibility with the evaluators.
Some evaluators may require additional fields such as contexts
and expected_output
.
import pandas as pd
entries = pd.DataFrame(
{
"input": ["hello", "how are you?", "what is your name?"],
"output": ["hi", "I am a chatbot, no feelings", "My name is Bob"],
}
)
Run Evaluations
With a single call to the evaluate method, you can evaluate all data entries using the specified evaluators. Note that certain evaluators require a settings parameter. In this case, it is used to define the competitor’s name to be blocklisted. For further documentation, refer to Evaluators.
results = langevals.evaluate(
entries,
[
RagasAnswerRelevancyEvaluator(),
CompetitorBlocklistEvaluator(
settings=CompetitorBlocklistSettings(competitors=["Bob"])
),
],
)
Access the Results
Finally, the results can be accessed as a pandas dataframe.
results.to_pandas()
Results:
input | output | answer_relevancy | competitor_blocklist | competitor_blocklist_details |
---|---|---|---|---|
hello | hi | 0.800714 | True | None |
how are you? | I am a chatbot, no feelings | 0.813168 | True | None |
what is your name? | My name is Bob | 0.971663 | False | Competitors mentioned: Bob |