What Are Experiments?
Experiments in Lemma are offline evaluations for your LLM agents. Unlike traditional unit tests, which assume deterministic behavior, LLM agents produce different outputs for the same input. You can’t simply assertexpect(output).toBe("expected"). Instead, you need a structured way to run your agent against a fixed set of inputs, record the results, and compare how different approaches perform.
Experiments give you that structure: define test cases once, run multiple strategies against them, and analyze the results in the dashboard.
The Problem
LLM agents are non-deterministic. The same prompt can yield different responses depending on temperature, model version, or randomness in sampling. That makes them hard to evaluate:- Unit tests don’t fit — You can’t reliably assert exact output strings
- Manual testing doesn’t scale — You need to run many inputs to spot regressions or improvements
- A/B testing in production is risky — You want to compare strategies before shipping changes
How Experiments Work
- Define test cases — Curate a set of inputs (user messages, context, etc.) that represent real scenarios you care about
- Run your agent — Execute each strategy (prompt, model, config) against every test case
- Record results — Link each trace to the experiment with a strategy name
- Analyze — Compare strategies side-by-side in the Lemma dashboard
Key Concepts
Test Cases
Test cases are the inputs used to evaluate your agent. Each test case has:- Input data — The parameters passed to your agent (e.g., user message, context)
- Test case ID — A unique identifier that links results across strategies
Strategies
A strategy is a specific configuration or approach you’re testing. Examples:- Different system prompts
- Different models (GPT-4 vs Claude)
- Different temperature or sampling settings
- Different agent architectures (e.g., with or without tools)
Results
Results link your agent’s traces to the experiment. Each result contains:- Run ID — The trace for this execution
- Test case ID — Which input was used
- Strategy name — Which approach was tested
When to Use Experiments
Experiments are useful whenever you want to compare agent behavior before shipping changes:- Prompt changes — Does a new system prompt improve accuracy or tone?
- Model swaps — How does GPT-4 compare to Claude on your use case?
- Architecture changes — Does adding tool use help or hurt?
- Regression checks — Did a refactor or dependency update break anything?
- Hyperparameter tuning — What temperature works best for your task?
Experiments run offline against a fixed test set. For live A/B testing with real users, use Lemma’s tracing and metrics to compare strategies in production.
Next Steps
- Run your first experiment — Follow Running Experiments to fetch test cases, run your agent, and record results in one call
- Learn the workflow — See a complete example in Running Experiments
- Review core concepts — Concepts covers traces, metrics, and projects in more detail

