Offline Evaluation
How to evaluate an LLM when you don't have defined answers
Measuring your LLM performance using an LLM-as-a-judge
For some AI applications, it’s not really possible to define a golden answer, this happens for example in creative tasks, where it’s hard to define a single correct answer.
On the video below, we show how to use the LangWatch Evaluation Wizard to evaluate a Business Coaching Agent, where we don’t have defined answers, but we can use an LLM-as-a-judge to evaluate the quality of the answers: