Measuring RAG Performance
Discover how to measure the performance of Retrieval-Augmented Generation (RAG) systems using metrics like retrieval precision, answer accuracy, and latency.
In this cookbook, we demonstrate how to build a RAG application and apply a systematic evaluation framework using LangWatch. We’ll focus on data-driven approaches to measure and improve retrieval performance.
Traditionally, RAG evaluation emphasizes the quality of the generated answers. However, this approach has major drawbacks: it’s slow (you must wait for the LLM to generate responses), expensive (LLM usage costs add up quickly), and subjective (evaluating answer quality can be inconsistent). Instead, we focus on evaluating retrieval, which is fast, cheap, and objective.
Requirements
Before starting, ensure you have the following packages installed:
Setup
Start by setting up LangWatch to monitor your RAG application:
Retrieval Metrics
Before building our RAG system, let’s understand the key metrics we’ll use to evaluate retrieval performance:
Precision measures how many of our retrieved items are actually relevant. If your system retrieves 10 documents but only 5 are relevant, that’s 50% precision.
Recall measures how many of the total relevant items we managed to find. If there are 20 relevant documents in your database but you only retrieve 10 of them, that’s 50% recall.
Mean Reciprocal Rank (MRR) measures how high the first relevant document appears in your results. If the first relevant document is at position 3, the MRR is 1/3.
If you retrieve a large number of documents (e.g., 100) and only a few are relevant, you have high recall but low precision — forcing the LLM to sift through noise. If you retrieve very few documents and miss many relevant ones, you have high precision but low recall — limiting the LLM’s ability to generate good answers. Assuming LLMs improve at selecting relevant information, recall becomes more and more important. That’s why most practitioners focus on optimizing recall. MRR is helpful when displaying citations to users. If citation quality isn’t critical for your app, focusing on precision and recall is often enough.
Generating Synthetic Data
In many domains - enterprise tools, legal, finance, internal docs - you don’t start with an evaluation dataset. You don’t have thousands of labeled questions or relevance scores. You barely have users. But you do have access to your own corpus. And with a bit of prompting, you can start generating useful data from it. If you already have a dataset, you can use it directly. If not, you can generate a synthetic dataset using LangWatch’s data_simulator
library. For retrieval evaluation, your dataset should contain queries and the expected document IDs that should be retrieved. In this example, I downloaded four research papers (GPT-1, GPT-2, GPT-3, GPT-4) and will use data_simulator
to generate queries based on them.
This library allows me to provide a context and example queries, and it will generate a dataset of queries and expected document IDs. Let’s take a look at some of the queries it generated:
Notice how the questions even look like they could be from a real user! This is because we provided example queries that resembled user behavior. This is a quick way to get started with evaluating your RAG application. As you start collecting real-world data, you can use provide those as example_queries and generate more useful data.
Setting up a Vector Database
Let’s use a vector database to store our documents and retrieve them based on user queries. We’ll initialize two collections, one with small embeddings and one with large embeddings. This will help us test the performance of our RAG system with different embedding models.
Parametrizing our Retrieval Pipeline
The key to running quick experiments is to parametrize the retrieval pipeline. This makes it easy to swap different retrieval methods as your RAG system evolves. In this example, we’ll compare a small and large embedding model based on recall and MRR. We’ll also vary the number of retrieved documents (k) to see how performance changes.
First, we’ll define two functions: one for retrieving documents and one for evaluating retrieval performance.
Now we can set up our parametrized retrieval pipeline.
Visualizing the Results
Let’s visualize the results:
We can see that the best configuration for recall is the small embedding model with k=10. This is surprising, as we would expect the large embedding model to perform better. Although, if we cared a lot more about citations, the large embedding model might be preferred.
Conclusion
Based on our evaluation results, we can now make data-driven decisions about the RAG system. In this case, the smaller embedding model outperformed the larger one for our use case, which brings both performance and cost benefits. Since many factors influence RAG performance, it’s important to run more experiments — varying parameters like:
- Document chunking strategies: Try different chunk sizes and overlap percentages
- Adding a reranker: Test if a separate reranking step improves precision
- Hybrid retrieval: Combine vector search with BM25 or other keyword-based methods
- Query expansion: Test if expanding queries with an LLM improves recall
Keep in mind: these results are specific to our test dataset. Your evaluations may reveal different trade-offs based on your domain and data characteristics.
In the next notebook, we’ll explore how fine-tuning embedding models can impact retrieval — and why you (almost) always should.
For the full notebook, check it out on: GitHub.