Skip to main content

What Are Experiments?

Experiments in Lemma are offline evaluations for your LLM agents. Unlike traditional unit tests, which assume deterministic behavior, LLM agents produce different outputs for the same input. You can’t simply assert expect(output).toBe("expected"). Instead, you need a structured way to run your agent against a fixed set of inputs, record the results, and compare how different approaches perform. Experiments give you that structure: define test cases once, run multiple strategies against them, and analyze the results in the dashboard.

The Problem

LLM agents are non-deterministic. The same prompt can yield different responses depending on temperature, model version, or randomness in sampling. That makes them hard to evaluate:
  • Unit tests don’t fit — You can’t reliably assert exact output strings
  • Manual testing doesn’t scale — You need to run many inputs to spot regressions or improvements
  • A/B testing in production is risky — You want to compare strategies before shipping changes
Offline experiments solve this by running your agent against a curated set of test cases in a controlled environment. You get reproducible, comparable results without affecting real users.

How Experiments Work

  1. Define test cases — Curate a set of inputs (user messages, context, etc.) that represent real scenarios you care about
  2. Run your agent — Execute each strategy (prompt, model, config) against every test case
  3. Record results — Link each trace to the experiment with a strategy name
  4. Analyze — Compare strategies side-by-side in the Lemma dashboard
Each strategy produces traces for the same inputs. That lets you see how different approaches perform on identical scenarios.

Key Concepts

Test Cases

Test cases are the inputs used to evaluate your agent. Each test case has:
  • Input data — The parameters passed to your agent (e.g., user message, context)
  • Test case ID — A unique identifier that links results across strategies
You define test cases in your Lemma project, then run any strategy against them. The same test case is used for every strategy, so comparisons are apples-to-apples.

Strategies

A strategy is a specific configuration or approach you’re testing. Examples:
  • Different system prompts
  • Different models (GPT-4 vs Claude)
  • Different temperature or sampling settings
  • Different agent architectures (e.g., with or without tools)
When you run an experiment, you tag each run with a strategy name. Lemma groups results by strategy so you can compare performance side-by-side.

Results

Results link your agent’s traces to the experiment. Each result contains:
  • Run ID — The trace for this execution
  • Test case ID — Which input was used
  • Strategy name — Which approach was tested
Once recorded, you can analyze them in the dashboard to see how each strategy performed on specific test cases and identify patterns in failures or edge cases.

When to Use Experiments

Experiments are useful whenever you want to compare agent behavior before shipping changes:
  • Prompt changes — Does a new system prompt improve accuracy or tone?
  • Model swaps — How does GPT-4 compare to Claude on your use case?
  • Architecture changes — Does adding tool use help or hurt?
  • Regression checks — Did a refactor or dependency update break anything?
  • Hyperparameter tuning — What temperature works best for your task?
Experiments run offline against a fixed test set. For live A/B testing with real users, use Lemma’s tracing and metrics to compare strategies in production.

Next Steps

  • Run your first experiment — Follow Running Experiments to fetch test cases, run your agent, and record results in one call
  • Learn the workflow — See a complete example in Running Experiments
  • Review core conceptsConcepts covers traces, metrics, and projects in more detail