LLMs are by nature non-deterministic, producing answers to any open questions, this is their very power, but it also what makes it hard to guarantee the quality, as two inputs wonât produce the same output. This leads to lack of confidence for going to production, and unscalable testing practice, relying solely on the âvibeâ feeling of the development team.
What we need then, is a systematic approach for measuring quality, and improving LLM applications: you need a dataset with good test examples to run through your LLM application, and a way to define quality, either with strict definitions (e.g: golden answer, retrieval metrics), or open ones (e.g: task was achieved, openly defined criteria and style).
In this guide, we are going to explore a few in-depth use cases for offline evaluations, and how can you easily get started with LLM evals by using LangWatch.
Evaluating if the LLM is generating the right answers
Letâs consider the case where there is actually a correct answer that you expect to be generated by your LLM given a set of example questions (if this is not your case, jump to other sections for more use cases where there is definition of correct answer), maybe you have some internal documents you are using to generate answers, or maybe you have a customer support agent that should reply the questions correctly. We will use the latest as our example, letâs evaluate a customer support agent.- First, go to the evaluations page and click in New Evaluation:

- Choose Offline evaluation:

- Now itâs time we choose a sample dataset. You could generate a new one with AI here, but you can also use this one we already provide for the Customer Support Agent example, just click the link to download: Download Dataset
- Now choose Upload CSV and select the dataset file:

- Save the dataset, and you should see the full 200 examples in there:

- Now press âNextâ. This is where we are going to choose our executor, that will run our examples. This could be for example an API that your application have where we can run the examples through, a Python code block, or a simple prompt we can create right now.
So letâs choose Create a prompt to get started:

- Then paste our sample Customer Service prompt in there:
- Also choose the LLM you want to execute this, Iâm going to run it with gemini-2.0-flash-lite:

- Next, itâs time to chose our evaluator, here is where you have all the options depending on your use case. For our use case, we do have the expected answers before hand, so we can choose âExpected Answer Evaluationâ:

- Next, select âLLM Answer Matchâ, this evaluator uses an LLM to compare to generated answer (output) with the gold standard one (expected_output) and verify that they are equivalent, that is, that they both reply the question the same way, even if written very differently:

- Press âNextâ, then finally, itâs time for us to run our evaluation, give it a name and press âRun Evaluationâ:

- You can see the results now, and they are great! 95% pass, just a couple answers wrong that we might want to iron out.
