Datasets
Understanding the Role of Datasets
Datasets are at the core of the Optimization Studio’s functionality. When working with non-deterministic systems like LLMs, running your tests across multiple examples is crucial for confidence in your results. While you might get lucky with a single successful test, running your LLM against hundreds of examples provides much more reliable validation of your solution.
The good news is that you don’t need an enormous dataset to get started. As little as 20 examples can already provide meaningful results with the DSPy optimizers, thanks to their intelligent use of LLM capabilities.
Creating and Managing Datasets (0:50)
If you already use LangWatch for monitoring, you can import the production data generated by your LLMs as a dataset, otherwise, you can also create or import a new dataset on optimization studio directly.
Creating and Editing Datasets
Access the dataset editor by double-clicking on the dataset in the node or sidebar, this provides a spreadsheet-like interface where you can:
- Add new records manually or modify existing entries
- Add or remove columns
- Make real-time changes to experiment on your workflow
- Collaborate with team members and domain experts
Importing Existing Data (1:45)
If you already have data in CSV format, you can easily import it:
- Use the upload CSV option
- Configure column types and formats
- Add additional columns as needed
- Save and immediately use in your workflows
Dataset Configuration (2:23)
Manual Test Entry Settings
The “manual test entry” setting controls which data point is used during manual execution:
- “Random” (default): Picks a different entry each time you run
- “First Entry”: Always uses the same entry for consistent testing
- This setting only affects manual testing, not full evaluations
Dataset Splitting (2:52)
One of the most important aspects of working with datasets is how they’re split for optimization and testing:
Default 80-20 Split
- Optimization Set (80%): Used for training and improving your LLM pipeline
- Test Set (20%): Reserved for validation to ensure your optimizations generalize well
You can adjust this split based on your needs:
- Use fixed numbers instead of percentages
- Modify the split ratio for different use cases
- Balance between optimization data and test data
Shuffle Seed Configuration (3:51)
The shuffle seed is crucial for maintaining consistent, unbiased testing:
- Prevents dataset ordering bias
- Ensures consistent splitting across runs
- Can be modified to test resilience to different data arrangements
- The default 42 seed can be changed to any number for randomization
Evaluation Basics (4:56)
While detailed evaluation is covered in later tutorials, the basic workflow involves:
- Clicking the Evaluate button
- Documenting changes made to your pipeline
- Selecting which dataset partition to evaluate against
- Adding necessary LLM API keys
The evaluation panel provides:
- Total entries processed
- Average cost per entry
- Total runtime
- Overall experiment costs
This foundation in dataset management sets you up for evaluating the quality and running automated optimizations, which are covered in subsequent tutorials.