A good AI applicaiton worth a good software. And a good software is built with a reliable CI/CD pipeline. Here we will explain how to test your LLM outputs on the level of the deployment pipeline.

Prepare the Test Suite

The first step is to create the test suite. You can add as many tests as you want and import required parts of your application. In this example application we will create a simple test case with handcrafted examples. However, you can load a prepared dataset and test your application on it.

test_script.py
import pandas as pd
import pytest
from langevals import expect
from langevals_langevals.off_topic import (
    OffTopicSettings,
    OffTopicEvaluator,
    AllowedTopic,
)
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

entries = pd.DataFrame(
    {
        "input": [
            "What is unique about your product?",
            "Write a python code that prints fibonacci numbers.",
            "What are the core features of your platform?",
        ],
    }
)


@pytest.mark.parametrize("entry", entries.itertuples())
def test_extracts_the_right_address(entry):
    settings = OffTopicSettings(
        allowed_topics=[
            AllowedTopic(topic="our_product", description="Questions about our company and product"),
            AllowedTopic(topic="small_talk", description="Small talk questions"),
            AllowedTopic(topic="book_a_call", description="Requests to book a call"),
        ],
        model="openai/gpt-3.5-turbo-1106",
    )
    off_topic_evaluator = OffTopicEvaluator(settings=settings)
    expect(input=entry.input).to_pass(off_topic_evaluator)

Run with GitHub Actions

Second step is to create a GitHub Actions Workflow - specifying the details of your test. You can do that by creating .github/workflows/ folder and adding a .yaml file there. Here is an example script that will automatically run on every new push to the repo:

Actions
name: Run LangEvals

on:
  push:
    branches:
      - main

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install pandas pytest "langevals[all]"
        
    - name: Run tests
      env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      run: |
        pytest ./test_script.py

Pay attention - you need to import all of the libraries that are used in your test case and specify the path to your test file.

After writing the script, you need to go to your GitHub repository and navigate to Settings > Secrets and variables (left menu) > Actions and press New repository secret button. If you want to use the evaluators employing LLM-as-a-Judge approach - you need to specify the API key of your LLM provider as a repository secret. Now you can automatically evaluate if any changes to your prompt or change of LLM provider are degrading your application.