Overview

What Are Experiments?

Experiments in Lemma are offline evaluations for your LLM agents. Unlike traditional unit tests, which assume deterministic behavior, LLM agents produce different outputs for the same input. You can’t simply assert expect(output).toBe("expected"). Instead, you need a structured way to run your agent against a fixed set of inputs, record the results, and compare how different approaches perform. Experiments give you that structure: define test cases once, run multiple strategies against them, and analyze the results in the dashboard.

The Problem

LLM agents are non-deterministic. The same prompt can yield different responses depending on temperature, model version, or randomness in sampling. That makes them hard to evaluate:

Unit tests don’t fit — You can’t reliably assert exact output strings
Manual testing doesn’t scale — You need to run many inputs to spot regressions or improvements
A/B testing in production is risky — You want to compare strategies before shipping changes

Offline experiments solve this by running your agent against a curated set of test cases in a controlled environment. You get reproducible, comparable results without affecting real users.

How Experiments Work

Define test cases — Curate a set of inputs (user messages, context, etc.) that represent real scenarios you care about
Run your agent — Execute each strategy (prompt, model, config) against every test case
Record results — Link each trace to the experiment with a strategy name
Analyze — Compare strategies side-by-side in the Lemma dashboard

Each strategy produces traces for the same inputs. That lets you see how different approaches perform on identical scenarios.

Key Concepts

Test Cases

Test cases are the inputs used to evaluate your agent. Each test case has:

Input data — The parameters passed to your agent (e.g., user message, context)
Test case ID — A unique identifier that links results across strategies

You define test cases in your Lemma project, then run any strategy against them. The same test case is used for every strategy, so comparisons are apples-to-apples.

Strategies

A strategy is a specific configuration or approach you’re testing. Examples:

Different system prompts
Different models (GPT-4 vs Claude)
Different temperature or sampling settings
Different agent architectures (e.g., with or without tools)

When you run an experiment, you tag each run with a strategy name. Lemma groups results by strategy so you can compare performance side-by-side.

Results

Results link your agent’s traces to the experiment. Each result contains:

Run ID — The trace for this execution
Test case ID — Which input was used
Strategy name — Which approach was tested

Once recorded, you can analyze them in the dashboard to see how each strategy performed on specific test cases and identify patterns in failures or edge cases.

When to Use Experiments

Experiments are useful whenever you want to compare agent behavior before shipping changes:

Prompt changes — Does a new system prompt improve accuracy or tone?
Model swaps — How does GPT-4 compare to Claude on your use case?
Architecture changes — Does adding tool use help or hurt?
Regression checks — Did a refactor or dependency update break anything?
Hyperparameter tuning — What temperature works best for your task?

Experiments run offline against a fixed test set. For live A/B testing with real users, use Lemma’s tracing and metrics to compare strategies in production.

Next Steps

Run your first experiment — Follow Running Experiments to fetch test cases, run your agent, and record results in one call
Learn the workflow — See a complete example in Running Experiments
Review core concepts — Concepts covers traces, metrics, and projects in more detail

Getting Started

Tracing

Experiments

Integrations

What Are Experiments?

The Problem

How Experiments Work

Key Concepts

Test Cases

Strategies

Results

When to Use Experiments

Next Steps

Getting Started

Tracing

Experiments

Integrations

​What Are Experiments?

​The Problem

​How Experiments Work

​Key Concepts

​Test Cases

​Strategies

​Results

​When to Use Experiments

​Next Steps

What Are Experiments?

The Problem

How Experiments Work

Key Concepts

Test Cases

Strategies

Results

When to Use Experiments

Next Steps