Finetuning Agents with GRPO
Learn how to enhance the performance of agentic systems by fine-tuning them with Generalized Reinforcement from Preference Optimization (GRPO).
In this cookbook, we’ll explore how to enhance the performance of agentic systems by fine-tuning them with Generalized Reinforcement from Preference Optimization (GRPO). Specifically, we’ll focus on query rewriting - a critical component in retrieval systems that transforms vague user questions into more effective search queries.
What makes this approach particularly exciting is that we’ll be using a smaller model - Qwen 1.7B - rather than relying on massive models like GPT-4. This demonstrates how GRPO can unlock impressive capabilities from more efficient, cost-effective models that can run locally or on modest hardware.
GRPO, as implemented in DSPy, is a powerful technique that generalizes popular online reinforcement learning algorithms, enabling more effective learning from interactions. By applying GRPO to query rewriting with smaller models, we can systematically improve retrieval performance without the computational and financial costs of larger models.
In this notebook, we’ll walk through:
- Setting up a DSPy environment with the Qwen 1.7B model
- Creating a simple query rewriting agent for retrieval
- Defining a reward function based on retrieval success
- Fine-tuning the query rewriter with GRPO
- Evaluating the performance improvements
By the end, you’ll understand how to apply GRPO to optimize query rewriting using smaller models, achieving better performance without relying on massive models or extensive manual prompt engineering.
Requirements
Before we begin, ensure you have the necessary packages. If you’re running this in an environment where dspy
and its dependencies are not yet installed, you might need to install them. For this notebook, the key libraries are dspy
and potentially others for data handling or specific model interactions.
Set up
First, let’s configure our environment. This involves connecting to an AI model provider. In this example, we’ll set up a connection to a local Arbor server, which will act as our Reinforcement Learning (RL) server. This server handles inference and RL requests over HTTP. We’ll also specify and load the Qwen3-1.7B model.
Load Dataset
With our environment configured, the next step is to load a dataset. For this example, we’ll use a dataset containing questions about GPT research papers (GPT-1, GPT-2, GPT-3, GPT-4). Each example contains a query and its expected answer.
DSPy works with examples in a specific format, so we’ll convert our raw data into dspy.Example
objects. Each example will have a question as input and the expected answer for evaluation. We’ll split our dataset into training, validation, and test sets to properly evaluate our approach.
The training set will be used to optimize our agent, the validation set to tune parameters and monitor progress, and the test set for final evaluation.
Implement Search Functionality
Before building our agent, we need to implement the search functionality that will retrieve relevant documents based on a query. In a real-world application, this might connect to a vector database or search engine.
For this example, we’ll create a simple search function that simulates document retrieval from our corpus of GPT research papers. The function will:
- Take a query string and number of results (k) as input
- Tokenize and embed the query
- Retrieve the k most relevant documents based on embedding similarity
- Return the list of retrieved documents
This search function will be used by our agent to find information relevant to user questions.
Building the Agent
Now we’ll create our agent using DSPy’s module system. Our agent will be a simple query rewriter that takes a user question, rewrites it to be more specific and search-friendly, and then retrieves relevant documents.
The agent consists of two main components:
- A query rewriting module that uses Chain-of-Thought reasoning to improve the original question
- A document retrieval step that uses our search function to find relevant information
This simple agent will serve as our baseline before optimization with GRPO.
Defining the Reward Function
For GRPO to work effectively, we need to define a reward function that evaluates the performance of our agent. This function will determine how well the agent is doing and guide the optimization process.
In our case, we’ll use a simple reward function that checks if any of the retrieved documents contain the expected answer. This binary reward (0 or 1) will indicate whether the agent successfully found the information needed to answer the user’s question.
For this example, we’ll keep it simple with a binary reward based on exact substring matching.
Evaluating the Baseline Agent
Before optimizing our agent, we need to establish a baseline performance. This will help us measure the improvement achieved through GRPO.
We’ll use DSPy’s evaluation framework to test our agent on the validation set. The evaluation will:
- Run the agent on each example in the validation set
- Apply our reward function to measure performance
- Calculate the average reward across all examples
This baseline score will serve as our reference point for improvement.
Optimizing with GRPO
Now that we have our baseline agent and evaluation metric, we can apply GRPO to optimize the agent’s performance. GRPO works by:
- Sampling multiple outputs from the agent for each input
- Evaluating each output using our reward function
- Using the rewards to update the model’s parameters through reinforcement learning
The key parameters for GRPO include:
update_interval
: How often to update the modelnum_samples_per_input
: How many different outputs to generate for each inputnum_train_steps
: Total number of training stepsbeta
: Controls the trade-off between optimizing for rewards and staying close to the original model
We’ll configure these parameters and run the optimization process.
Evaluating the Optimized Agent
After optimizing our agent with GRPO, we need to evaluate its performance to see how much it has improved. We’ll use the same evaluation framework as before, but now with our optimized agent.
We’ll also compare the baseline and optimized agents on a specific example to see the differences in their behavior. This will help us understand how GRPO has changed the agent’s query rewriting strategy.
Conclusion
In this cookbook, we explored how to apply GRPO to optimize an LLM-based agent for query rewriting using a compact model like Qwen 1.7B. While the baseline performance was modest (28%), the GRPO-optimized agent did not show an improvement in this short run (26%).
This result highlights an important consideration: meaningful improvements with reinforcement learning methods like GRPO often require longer training durations and possibly more diverse training data. In our experiment, training was conducted on 8×A100 GPUs for approximately 2 hours, which likely wasn’t sufficient time for the model to fully benefit from the GRPO optimization process.
That said, the infrastructure and methodology are solid. GRPO offers a systematic approach to improving agent behavior through preference-based feedback, and with extended training time or further reward shaping, it’s reasonable to expect more substantial performance gains.
For the full notebook, check it out on: GitHub.