In this cookbook, we demonstrate how to enhance retrieval performance by implementing hybrid search in your RAG applications. We’ll explore how structured metadata can dramatically improve search relevance and precision beyond what vector similarity alone can achieve.

When users search for products, documents, or other content, they often have specific attributes in mind. For example, a shopper might want “red dresses for summer occasions” or a researcher might need “papers on climate change published after 2020.” Pure semantic search might miss these nuances, but metadata filtering allows you to combine the power of vector search with explicit attribute filtering.

Like always, we’ll focus on data-driven approaches to measure and improve retrieval performance.

Requirements

Before starting, ensure you have the following packages installed:

pip install langwatch datasets pydantic openai instructor asyncio tenacity

Setup

Start by setting up LangWatch to monitor your RAG application:

import chromadb
import openai
import getpass
import langwatch

# Initialize OpenAI, LangWatch & HuggingFace
openai.api_key = getpass.getpass('Enter your OpenAI API key: ')
langwatch.api_key = getpass.getpass('Enter your LangWatch API key: ')
huggingface_api_key = getpass.getpass("Enter your Huggingface API key: ")
chroma_client = chromadb.PersistentClient()

The Dataset

In this cookbook, we’ll work with a product catalog dataset containing fashion items with structured metadata. The dataset includes:

  • Basic product information: titles, descriptions, brands, and prices
  • Categorization: categories, subcategories, and product types
  • Attributes: structured characteristics like sleeve length, neckline, and fit
  • Materials and patterns: fabric types and design patterns

Here’s what our taxonomy structure looks like:

{
  "taxonomy_map": {
    "Women": {
      "Tops": {
        "product_type": [
          "T-Shirts",
          "Blouses",
          "Sweaters",
          "Cardigans",
          "Tank Tops",
          "Hoodies",
          "Sweatshirts"
        ],
        "attributes": {
          "Sleeve Length": [
            "Sleeveless",
            "Short Sleeve",
            "3/4 Sleeve",
            "Long Sleeve"
          ],
          "Neckline": [
            "Crew Neck",
            "V-Neck",
            "Turtleneck",
            "Scoop Neck",
            "Cowl Neck"
          ],
          "Fit": ["Regular", "Slim", "Oversized", "Cropped"]
        }
      },
      "Bottoms": {
        "product_type": ["Pants", "Jeans", "Shorts", "Skirts", "Leggings"],
        "attributes": {
          // Additional attributes...
        }
      }
    }
  }
}

Having well-structured metadata enables more precise filtering and can significantly improve search relevance, especially for domain-specific applications where users have particular attributes in mind. This data might come from manual tagging by product managers or automated processes with LLMs.

Let’s first load the dataset from Huggingface:

from datasets import load_dataset

labelled_dataset = load_dataset("ivanleomk/labelled-ecommerce-taxonomy")["train"]

Now we can load it into our Chroma vector database.

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# Initialize Chroma
client = chromadb.PersistentClient()

# Initialize embeddings
embedding_function = OpenAIEmbeddingFunction(model_name="text-embedding-3-large", api_key=openai.api_key)

# Create collections
collection = client.get_or_create_collection(name="collection", embedding_function=embedding_function)

# Add documents to both collections
for row in labelled_dataset:
    collection.add(
        documents=[row['description']],
        ids=[str(row['id'])],
        metadatas = {
            'category': row['category'],
            'subcategory': row['subcategory'],
            'product_type': row['product_type'],
            'occasions': row['occasions'],
            'brand': row['brand'],
            'price': float(row['price']),
            'attributes': row['attributes'],
            'material': row['material'],
            'pattern': row['pattern'],
            'title': row['title'],
            'id': row['id']
        }
    )

print(f"Created collection with {collection.count()} documents.")

Understanding Our Vector Database

We’ve now loaded our product catalog into a Chroma vector database with the following components:

  1. Document Text: The product descriptions that will be embedded and used for semantic search
  2. Metadata: Structured attributes like category, price, material, etc., that can be used for filtering

This setup allows us to perform both:

  • Pure semantic search: Finding products based on the meaning of their descriptions
  • Hybrid search: Combining semantic similarity with explicit metadata filters

The embeddings are generated using OpenAI’s embedding model, which creates high-dimensional vectors that represent the semantic content of each product description. Similar products will have vectors that are close together in this high-dimensional space.

Generating Synthetic Data

When you don’t have production data to start with, you can generate synthetic data to simulate a real-world scenario. We already have the ‘output’, which is the clothing item we just embedded. We now want to generate synthetic queries that would be relevant to the clothing item.

In this case, we’ll use GPT-4 to generate realistic user queries that would naturally lead to each product in our catalog. This gives us query-product pairs where we know the ground truth relevance.

import random
from openai import OpenAI
from tqdm import tqdm

# Initialize OpenAI client
client = OpenAI(api_key=openai.api_key)

# Define query types to generate variety
query_types = [
    "Basic search for specific item",
    "Search with price constraint",
    "Search for specific occasion",
    "Search with material preference",
    "Search with style/attribute preference"
]

def generate_synthetic_query(item):
    """Generate a realistic search query for a clothing item"""

    # Select a random query type
    query_type = random.choice(query_types)

    # Create prompt for the LLM
    prompt = f"""
    Generate a realistic search query that would lead someone to find this specific clothing item:

    Item Details:
    - Title: {item["title"]}
    - Description: {item["description"]}
    - Category: {item["category"]}
    - Subcategory: {item["subcategory"]}
    - Product Type: {item["product_type"]}
    - Price: ${item["price"]}
    - Material: {item["material"]}
    - Attributes: {item["attributes"]}
    - Occasions: {item["occasions"]}

    The query should be in a conversational tone, about 10-20 words, and focus on a {query_type.lower()}.
    Don't mention the exact product name, but include specific details that would make this item a perfect match.

    Example: For a $120 silk blouse with long sleeves, a query might be:
    "Looking for an elegant silk top with long sleeves for work, under $150"
    """

    # Generate query using OpenAI
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that generates realistic shopping queries."},
            {"role": "user", "content": prompt}
        ]
    )

    # Extract the generated query
    query = response.choices[0].message.content.strip().strip('"')

    return {"query": query, **item}

# Generate queries
synthetic_queries = []
for item in tqdm(labelled_dataset, desc="Generating queries"):
    query_data = generate_synthetic_query(item)
    synthetic_queries.append(query_data)

Let’s visualize what this looks like:

from rich import print

print(synthetic_queries[0])
{
    'query': 'Searching for a sleeveless top with lace detailing at the neckline for casual outings and dinner
dates.',
    'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=768x1024 at 0x13E0BB230>,
    'title': 'Lace Detail Sleeveless Top',
    'brand': 'H&M',
    'description': "Elevate your casual wardrobe with this elegant sleeveless top featuring intricate lace
detailing at the neckline. Perfect for both day and night, it's crafted from a soft, breathable fabric for all-day
comfort.",
    'category': 'Women',
    'subcategory': 'Tops',
    'product_type': 'Tank Tops',
    'attributes': '[{"name": "Sleeve Length", "value": "Sleeveless"}, {"name": "Neckline", "value": "Crew Neck"}]',
    'material': 'Cotton',
    'pattern': 'Solid',
    'id': 1,
    'price': 181.04,
    'occasions': '["Everyday Wear", "Casual Outings", "Smart Casual", "Dinner Dates", "Partywear"]'
}

Query Filtering

To implement metadata filtering, we first need to extract structured filters from natural language queries. This process involves:

  1. Understanding user intent: Identifying what specific attributes the user is looking for
  2. Mapping to our taxonomy: Converting natural language descriptions to our structured metadata schema
  3. Handling ambiguity: Resolving cases where the user’s language doesn’t precisely match our metadata values

This is where LLMs excel - they can understand the nuances of natural language and extract structured information that aligns with our predefined taxonomy. We’ll use a Pydantic model to ensure the extracted filters conform to our expected schema:

from typing import Optional
from pydantic import BaseModel

class Attribute(BaseModel):
    name: str
    values: list[str]

class QueryFilters(BaseModel):
    attributes: list[Attribute]
    material: Optional[list[str]]
    min_price: Optional[float] = None
    max_price: Optional[float] = None
    subcategory: str
    category: str
    product_type: list[str]
    occasions: list[str]

With these models in place, we can start extracting query filters from all queries. We need to let the LLM know what the possible taxonomies are. We’ll use the taxonomy.json file for this.

import json
from openai import OpenAI
from tqdm import tqdm

# Load taxonomy
taxonomy = json.load(open("../data/taxonomy.json"))

# Initialize OpenAI client
client = OpenAI(api_key=openai.api_key)

def extract_filters(item):
    """Extract filters from item metadata"""

    # Create prompt for the LLM
    prompt = f"""
    Extract shopping filters from this query: "{item['query']}"

    Return ONLY a JSON object with these possible keys:
    - category: The clothing category (e.g., "Women")
    - subcategory: The subcategory (e.g., "Tops", "Bottoms")
    - product_type: The specific product type (e.g., "T-Shirts", "Jeans")
    - max_price: Maximum price as a number (no $ symbol)
    - min_price: Minimum price as a number (no $ symbol)
    - material: The material (e.g., "Cotton", "Polyester")
    - occasion: The occasion (e.g., "Casual", "Formal")

    Only include keys that are explicitly mentioned in the query.
    Use ONLY values from the taxonomy I'll provide
    """

    # Get completion from OpenAI
    response = client.responses.parse(
        model="gpt-4o",
        input=[
            {"role": "system", "content": "You extract structured shopping filters from text queries."},
            {"role": "user", "content": prompt},
            {"role": "user", "content": f"Taxonomy data: {taxonomy}"}
        ],
        text_format=QueryFilters
    )

    # Extract the generated query
    filters = response.output_parsed

    return filters

# Extract Filters
filters = []
for item in tqdm(synthetic_queries, desc="Extracting filters"):
    filters.append(extract_filters(item))

Retrieval Evaluation: Semantic Search vs. Metadata Filtering

Now comes the critical part - evaluating how well each retrieval method performs. We’ll compare pure semantic search against metadata-filtered search using two key metrics:

  1. Recall: The proportion of relevant items successfully retrieved
  2. Mean Reciprocal Rank (MRR): How high relevant items appear in our results

These metrics help us understand different aspects of retrieval quality:

  • High recall means we’re finding most of the relevant items
  • High MRR means we’re ranking relevant items near the top of the results

By comparing these metrics across different retrieval methods, we can make data-driven decisions about which approach works best for our specific use case.

def calculate_recall(predictions: list[str], ground_truth: list[str]):
    """Calculate the proportion of relevant items that were retrieved"""
    return len([label for label in ground_truth if label in predictions]) / len(ground_truth)

def calculate_mrr(predictions: list[str], ground_truth: list[str]):
    """Calculate Mean Reciprocal Rank - how high the relevant items appear in results"""
    mrr = 0
    for label in ground_truth:
        if label in predictions:
            # Find the position of the first relevant item
            mrr = max(mrr, 1 / (predictions.index(label) + 1))
    return mrr

# Evaluation function
def evaluate_retrieval(retrieved_ids, expected_ids):

    """Evaluate retrieval performance using recall and MRR"""
    recall = calculate_recall(retrieved_ids, expected_ids)
    mrr = calculate_mrr(retrieved_ids, expected_ids)

    return {"recall": recall, "mrr": mrr}

For this evaluation, we’ll compare two distinct retrieval approaches:

  • Pure Semantic Search: Using only vector embeddings to find similar items
  • Semantic Search with Metadata Filtering: Combining vector similarity with structured metadata filters

This comparison will demonstrate how metadata filtering can significantly improve retrieval precision and relevance, especially for queries with specific attributes or constraints.

import numpy as np

# Define a function for pure semantic search
def pure_semantic_search(query, collection, k=5):
    """Perform pure semantic search without metadata filtering"""
    results = collection.query(
        query_texts=[query],
        n_results=k
    )

    retrieved_ids = results['ids'][0]

    return retrieved_ids

# Define a function for semantic search with metadata filtering
def semantic_search_with_metadata(query, collection, filters, k=5):
    """Perform semantic search with metadata filtering"""
    # Only proceed with filtering if filters are provided
    where_clause = None

    if filters:
        where_conditions = []

        # Add filters for each attribute
        where_conditions.append({"category": filters.category}) if filters.category else None
        where_conditions.append({"subcategory": filters.subcategory}) if filters.subcategory else None
        where_conditions.append({"product_type": {"$in": filters.product_type}}) if filters.product_type else None
        where_conditions.append({"material": {"$in": filters.material}}) if filters.material else None
        where_conditions.append({"price": {"$gte": filters.min_price}}) if filters.min_price else None
        where_conditions.append({"price": {"$lte": filters.max_price}}) if filters.max_price else None

        # Combine all conditions with $and operator if we have multiple conditions
        if len(where_conditions) > 1:
            where_clause = {
                "$and": where_conditions
            }
        elif len(where_conditions) == 1:
            where_clause = where_conditions[0]

    # Perform the query with filters
    results = collection.query(
        query_texts=[query],
        n_results=k,
        where=where_clause
    )

    retrieved_ids = results['ids'][0]

    return retrieved_ids

Now we can run the evals:

# Create a function to run the evaluation
def run_evaluation(queries, expected_ids, collection, k_values=[3, 5, 10]):
    """Run evaluation for both retrieval methods across different k values"""
    results = []

    for k in tqdm(k_values, desc="Evaluating k values"):
        pure_semantic_metrics = []
        metadata_filtering_metrics = []

        for i, (query, expected) in enumerate(tqdm(zip(queries, expected_ids), desc=f"Evaluating queries for k={k}", total=len(queries))):
            # Get the filters for this query
            query_filters = filters[i] if i < len(filters) else None

            # Run pure semantic search
            pure_semantic_results = pure_semantic_search(query, collection, k=k)
            pure_semantic_eval = evaluate_retrieval(pure_semantic_results, expected)
            pure_semantic_metrics.append(pure_semantic_eval)

            # Run semantic search with metadata filtering
            metadata_results = semantic_search_with_metadata(query, collection, query_filters, k=k)
            metadata_eval = evaluate_retrieval(metadata_results, expected)
            metadata_filtering_metrics.append(metadata_eval)

        # Calculate average metrics
        avg_pure_recall = np.mean([m["recall"] for m in pure_semantic_metrics])
        avg_pure_mrr = np.mean([m["mrr"] for m in pure_semantic_metrics])

        avg_metadata_recall = np.mean([m["recall"] for m in metadata_filtering_metrics])
        avg_metadata_mrr = np.mean([m["mrr"] for m in metadata_filtering_metrics])

        # Store results
        results.append({
            "k": k,
            "method": "pure_semantic",
            "avg_recall": avg_pure_recall,
            "avg_mrr": avg_pure_mrr
        })

        results.append({
            "k": k,
            "method": "metadata_filtering",
            "avg_recall": avg_metadata_recall,
            "avg_mrr": avg_metadata_mrr
        })

    return pd.DataFrame(results)
# Prepare the evaluation data
queries = [item["query"] for item in synthetic_queries]
expected_ids = [[str(item["id"])] for item in synthetic_queries]  # Each expected ID as a list

# Run the evaluation
k_values = [3, 5, 10]
results_df = run_evaluation(queries, expected_ids, collection, k_values)

print(results_df)
kmethodavg_recallavg_mrr
3pure_semantic0.9214660.846422
3metadata_filtering0.8167540.779232
5pure_semantic0.9267020.847731
5metadata_filtering0.8376960.784206
10pure_semantic0.9424080.849913
10metadata_filtering0.8586390.787354

Conclusion

Whilst writing this cookbook, I had secretly ‘hoped’ that hybrid search would outperform pure semantic search. Most people default to vector embeddings, but in production I found that structured metadata extraction consistently delivered better results.

However, this analysis shows that no application is the same. There is no ‘universal’ best method for doing things - it depends on the specific use case and the data at hand. In our particular experiment:

  • Pure semantic search achieved higher recall and MRR across all k values
  • This suggests that for this specific dataset and query set, the semantic meaning captured by embeddings was sufficient
  • The additional complexity of metadata filtering didn’t provide an advantage in this case

This highlights the importance of empirical evaluation rather than assuming one approach is always superior. Some possible reasons for these results:

  1. Our synthetic queries might be particularly well-aligned with the semantic content
  2. The metadata extraction might need refinement to better capture query intent
  3. The dataset might not have enough attribute diversity to showcase the benefits of filtering

I hope this analysis helps you make informed decisions about the best approach for your own use case.

For the full notebook, check it out on: GitHub.