Skip to content

CI integration

Run pytest-wardenbot on every push so chatbot regressions get caught before customers see them.

GitHub Actions

Drop this into .github/workflows/wardenbot.yml:

name: wardenbot

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch: {}

jobs:
  wardenbot:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install
        run: |
          python -m pip install --upgrade pip
          pip install pytest-wardenbot
      - name: Run tests
        env:
          CHATBOT_URL: ${{ secrets.CHATBOT_URL }}
          CHATBOT_TOKEN: ${{ secrets.CHATBOT_TOKEN }}
        run: pytest -v --tb=short

Add CHATBOT_URL (and CHATBOT_TOKEN if your bot needs auth) to repo secrets under Settings → Secrets and variables → Actions.

Full example with optional Slack notification on failure: examples/github_actions/wardenbot.yml

GitLab CI

wardenbot:
  image: python:3.12
  script:
    - pip install pytest-wardenbot
    - pytest -v --tb=short
  variables:
    CHATBOT_URL: $CHATBOT_URL
    CHATBOT_TOKEN: $CHATBOT_TOKEN
  rules:
    - if: $CI_PIPELINE_SOURCE == "push"
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"

CircleCI

version: 2.1

jobs:
  wardenbot:
    docker:
      - image: cimg/python:3.12
    steps:
      - checkout
      - run:
          name: Install
          command: pip install pytest-wardenbot
      - run:
          name: Run tests
          command: pytest -v --tb=short

workflows:
  test:
    jobs:
      - wardenbot:
          context:
            - chatbot-secrets   # exposes CHATBOT_URL + CHATBOT_TOKEN

Cost-aware scheduling

Deterministic tests are free (no LLM API spend). If you add LLM-judge tests via the [judge] extra, each suite run costs roughly $0.02. For frequent runs, you have options:

  • Tag judge tests and gate them. Mark them with @pytest.mark.slow and only run them on main (not every PR).
  • Run deterministic on every push, judge on nightly cron.
  • Skip judge tests entirely if your suite of deterministic tests is comprehensive enough for your needs.

Example: split workflow that runs deterministic on PR, judge on nightly:

on:
  pull_request: {}        # deterministic only
  schedule:
    - cron: "0 6 * * *"   # nightly with judge

jobs:
  wardenbot:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install (with judge on schedule only)
        run: |
          if [ "${{ github.event_name }}" = "schedule" ]; then
            pip install 'pytest-wardenbot[judge]'
          else
            pip install pytest-wardenbot
          fi
      - name: Run tests
        env:
          CHATBOT_URL: ${{ secrets.CHATBOT_URL }}
          CHATBOT_TOKEN: ${{ secrets.CHATBOT_TOKEN }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: pytest -v --tb=short

Handling intermittent failures

LLM responses are stochastic. A test that passes 99/100 times will occasionally fail in CI. Two mitigations:

  • Set temperature=0 in your chatbot (or as low as your model allows). Reduces variance.
  • Use pytest-rerunfailures to retry flaky tests:

    pip install pytest-rerunfailures
    pytest --reruns 2 --reruns-delay 5
    

If a deterministic test fails intermittently, that's likely real model drift, not flake. Investigate before reaching for --reruns.