Running Experiments

Experiments let you systematically evaluate your agent by running multiple strategies against a fixed set of test cases. Whether you’re comparing prompt variations, model choices, or architectural changes, experiments provide the structured framework to measure what actually works. Each experiment contains test cases (the inputs to evaluate) and results (traces linked to strategies). This structure makes it easy to compare how different approaches perform on identical inputs.

Prerequisites

Before running experiments, you need:

An experiment created in your Lemma project (find the experiment ID in your dashboard)
Tracing set up in your agent (see Tracing Your Agent)
Your API key from your project settings

Workflow Overview

A typical experiment workflow looks like this:

Get test cases — Fetch the inputs defined for your experiment
Run your agent — Execute each strategy against the test cases
Record results — Link each trace to the experiment with its strategy name
Analyze — Compare performance across strategies in the dashboard

Get Test Cases

Retrieve all test cases for an experiment to iterate over them:

async function getTestCases(experimentId: string, projectId: string) {
  const response = await fetch(
    `https://api.uselemma.ai/experiments/${experimentId}/test-cases?project_id=${projectId}`,
    {
      headers: {
        Authorization: `Bearer ${process.env.LEMMA_API_KEY}`,
      },
    }
  );

  if (!response.ok) {
    throw new Error(`Failed to get test cases: ${response.statusText}`);
  }

  return response.json();
}

Run Your Agent and Record Results

For each strategy you want to test, run your agent against all test cases and record the results:

import { wrapAgent } from "@uselemma/tracing";
import { tracerProvider } from "./tracer"; // Your tracer setup

async function runExperiment(
  experimentId: string,
  projectId: string,
  strategyName: string,
  runAgent: (input: Record<string, any>) => Promise<{ result: any; runId: string }>
) {
  // 1. Get test cases
  const testCases = await getTestCases(experimentId, projectId);

  // 2. Run agent on each test case and collect results
  const results = [];

  for (const testCase of testCases) {
    const { result, runId } = await runAgent(testCase.inputData);

    results.push({
      runId,
      testCaseId: testCase.id,
    });
  }

  await tracerProvider.forceFlush(); // ensure all spans are sent to Lemma

  // 3. Record all results for this strategy
  await recordResults(experimentId, projectId, strategyName, results);
}

Calling tracerProvider.forceFlush() ensures all spans are sent to Lemma before recording results. This is important because the RunBatchSpanProcessor batches all spans for an agent run and exports them together when the run ends.

Record Results

Link traces to your experiment with a strategy name:

async function recordResults(
  experimentId: string,
  projectId: string,
  strategyName: string,
  results: Array<{ runId: string; testCaseId?: string }>
) {
  const response = await fetch(
    `https://api.uselemma.ai/experiments/${experimentId}/results?project_id=${projectId}`,
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${process.env.LEMMA_API_KEY}`,
      },
      body: JSON.stringify({
        strategyName,
        results,
      }),
    }
  );

  if (!response.ok) {
    throw new Error(`Failed to record results: ${response.statusText}`);
  }

  return response.json();
}

Including testCaseId lets you compare how different strategies performed on the exact same input in the dashboard.

Example: Comparing Prompt Strategies

Here’s a complete example comparing two prompt strategies:

import { wrapAgent } from "@uselemma/tracing";
import { tracerProvider } from "./tracer";

const EXPERIMENT_ID = "your-experiment-id";
const PROJECT_ID = "your-project-id";

// Define your strategies
const strategies = {
  concise: {
    systemPrompt: "You are a helpful assistant. Be brief and direct.",
  },
  detailed: {
    systemPrompt:
      "You are a helpful assistant. Provide thorough explanations with examples when relevant.",
  },
};

// Agent runner for a specific strategy
function createAgentRunner(strategyConfig: { systemPrompt: string }) {
  return async (input: Record<string, any>) => {
    const wrappedFn = wrapAgent(
      "support-agent",
      async ({ onComplete }, agentInput) => {
        const result = await callLLM(strategyConfig.systemPrompt, agentInput.query);
        onComplete(result);
        return result;
      },
      { isExperiment: true }
    );

    const { result, runId } = await wrappedFn(input);
    return { result, runId };
  };
}

// Run experiment for each strategy
async function main() {
  for (const [strategyName, config] of Object.entries(strategies)) {
    console.log(`Running strategy: ${strategyName}`);
    const agentRunner = createAgentRunner(config);
    await runExperiment(EXPERIMENT_ID, PROJECT_ID, strategyName, agentRunner);
  }
}

main();

Experiment Mode

Instead of passing isExperiment: true to each wrapAgent call, you can enable experiment mode globally. When enabled, all wrapAgent calls are automatically tagged as experiment runs:

import { enableExperimentMode, disableExperimentMode } from "@uselemma/tracing";

enableExperimentMode();

// All agent runs in this block are tagged as experiments
for (const testCase of testCases) {
  await runAgent(testCase.inputData);
}

disableExperimentMode();

This is useful in experiment scripts where every agent run should be tagged as an experiment.

Viewing Results

Once you’ve recorded results, head to your experiment in the Lemma dashboard to:

Compare strategies side-by-side — See how each approach performed on the same inputs
Analyze traces — Drill into individual executions to understand behavior differences
Track metrics — If your experiment has an associated metric, view aggregated feedback per strategy
Identify patterns — Find which inputs cause problems for certain strategies

Guides

​Prerequisites

​Workflow Overview

​Get Test Cases

​Run Your Agent and Record Results

​Record Results

​Example: Comparing Prompt Strategies

​Experiment Mode

​Viewing Results

Prerequisites

Workflow Overview

Get Test Cases

Run Your Agent and Record Results

Record Results

Example: Comparing Prompt Strategies

Experiment Mode

Viewing Results