- Setting up a DSPy environment with the Qwen 1.7B model
- Creating a simple query rewriting agent for retrieval
- Defining a reward function based on retrieval success
- Fine-tuning the query rewriter with GRPO
- Evaluating the performance improvements
Requirements
Before we begin, ensure you have the necessary packages. If you’re running this in an environment wheredspy
and its dependencies are not yet installed, you might need to install them. For this notebook, the key libraries are dspy
and potentially others for data handling or specific model interactions.
Set up
First, let’s configure our environment. This involves connecting to an AI model provider. In this example, we’ll set up a connection to a local Arbor server, which will act as our Reinforcement Learning (RL) server. This server handles inference and RL requests over HTTP. We’ll also specify and load the Qwen3-1.7B model.Load Dataset
With our environment configured, the next step is to load a dataset. For this example, we’ll use a dataset containing questions about GPT research papers (GPT-1, GPT-2, GPT-3, GPT-4). Each example contains a query and its expected answer. DSPy works with examples in a specific format, so we’ll convert our raw data intodspy.Example
objects. Each example will have a question as input and the expected answer for evaluation. We’ll split our dataset into training, validation, and test sets to properly evaluate our approach.
The training set will be used to optimize our agent, the validation set to tune parameters and monitor progress, and the test set for final evaluation.
Implement Search Functionality
Before building our agent, we need to implement the search functionality that will retrieve relevant documents based on a query. In a real-world application, this might connect to a vector database or search engine. For this example, we’ll create a simple search function that simulates document retrieval from our corpus of GPT research papers. The function will:- Take a query string and number of results (k) as input
- Tokenize and embed the query
- Retrieve the k most relevant documents based on embedding similarity
- Return the list of retrieved documents
Building the Agent
Now we’ll create our agent using DSPy’s module system. Our agent will be a simple query rewriter that takes a user question, rewrites it to be more specific and search-friendly, and then retrieves relevant documents. The agent consists of two main components:- A query rewriting module that uses Chain-of-Thought reasoning to improve the original question
- A document retrieval step that uses our search function to find relevant information
Defining the Reward Function
For GRPO to work effectively, we need to define a reward function that evaluates the performance of our agent. This function will determine how well the agent is doing and guide the optimization process. In our case, we’ll use a simple reward function that checks if any of the retrieved documents contain the expected answer. This binary reward (0 or 1) will indicate whether the agent successfully found the information needed to answer the user’s question. For this example, we’ll keep it simple with a binary reward based on exact substring matching.Evaluating the Baseline Agent
Before optimizing our agent, we need to establish a baseline performance. This will help us measure the improvement achieved through GRPO. We’ll use DSPy’s evaluation framework to test our agent on the validation set. The evaluation will:- Run the agent on each example in the validation set
- Apply our reward function to measure performance
- Calculate the average reward across all examples
Optimizing with GRPO
Now that we have our baseline agent and evaluation metric, we can apply GRPO to optimize the agent’s performance. GRPO works by:- Sampling multiple outputs from the agent for each input
- Evaluating each output using our reward function
- Using the rewards to update the model’s parameters through reinforcement learning
update_interval
: How often to update the modelnum_samples_per_input
: How many different outputs to generate for each inputnum_train_steps
: Total number of training stepsbeta
: Controls the trade-off between optimizing for rewards and staying close to the original model