Skip to main content

Benchmarks API

Cross-pattern benchmark harness. The engine and convenience layer live in @flow-state-dev/testing; the pattern-agnostic contract types live in @flow-state-dev/core. For concepts and worked examples, see Benchmarking patterns.

Engine

runBenchmark(config)

Runs a sweep over explicit subjects and returns a BenchmarkReport.

function runBenchmark(config: RunBenchmarkConfig): Promise<BenchmarkReport>;

RunBenchmarkConfig:

FieldTypeDefaultDescription
subjectsBenchmarkSubject[]Subjects (patterns + optional baseline) to compare
tasksBenchmarkTask[]Tasks every subject is run against
modelstringExecutor model id used by all subjects
judgeModelstringmodelDistinct judge model. Defaults to model with a self-preference warning
scorersScorer<unknown>[]Extra deterministic code scorers applied alongside the rubric judge
runsnumber3Repetitions per (subject, task)
concurrencynumber3Concurrent (subject, task, run) cells
maxCostUsdnumberAbort and return a partial report when accumulated cost exceeds this
modelResolverModelResolverfrom modelExecutor resolver override (tests inject a mock)
judgeResolverModelResolverfrom judgeModelJudge resolver override
signalAbortSignalCancels in-flight scheduling; produces a partial report

Convenience layer

comparePatterns(registry, names, config)

Resolves names against registry, appends the single-generator baseline (unless baseline: false), and runs. Throws a clear error naming the available patterns when a name is missing.

function comparePatterns(
registry: BenchmarkRegistry,
names: string[],
config: ComparePatternsConfig,
): Promise<BenchmarkReport>;

type ComparePatternsConfig = Omit<RunBenchmarkConfig, "subjects"> & {
baseline?: boolean; // default true
};

The registry is the first argument so @flow-state-dev/testing never imports @flow-state-dev/patterns.

baselineSubject(options)

Builds the single-generator control: one generator that answers the task prompt directly, with no coordination. Deltas in the scorecard are measured against this subject.

function baselineSubject(options: {
model: string;
name?: string; // default "single-generator"
}): BenchmarkSubject;

defineBenchmark(def)

Validates and returns a benchmark definition (the CLI/registry discovery shape). Identity at runtime; throws if the task suite is empty.

function defineBenchmark(def: BenchmarkDefinition): BenchmarkDefinition;

BenchmarkDefinition:

FieldTypeDescription
namestringBenchmark name
subjectsBenchmarkSubject[]Explicit subjects, when not resolving via a registry
patternsstring[]Pattern names resolved against registry
registryBenchmarkRegistryRegistry used to resolve patterns into subjects
baselinebooleanAppend a baseline subject (default true at run time)
tasksBenchmarkTask[]Tasks to run. Must be non-empty
modelstringExecutor model id
judgeModelstringDistinct judge model id
scorersScorer<unknown>[]Extra deterministic code scorers
runsnumberRepetitions per (subject, task)

Report

BenchmarkReport

interface BenchmarkReport {
model: string;
judgeModel?: string;
runs: number;
subjects: string[];
categories: BenchmarkCategory[];
stats: SubjectCategoryStat[];
rankings: Record<string, BenchmarkRanking[]>;
totalCostUsd: number;
budgetExceeded: boolean;
warnings: string[];
timing: { totalMs: number };
}

rankings is keyed by category (and "overall"), each value a list of subjects sorted by mean descending:

interface BenchmarkRanking {
subject: string;
mean: number; // mean judge score (0-1)
deltaVsBaseline: number; // mean minus the baseline subject's mean (0 when no baseline)
credible: boolean; // delta clears ~2x the standard error of the difference of means (needs >=2 runs)
}

stats carries per (subject × category) and per (subject × "overall") aggregates:

interface SubjectCategoryStat {
subject: string;
category: BenchmarkCategory | "overall";
mean: number;
stddev: number; // population standard deviation
passRate: number;
runs: number; // cells scheduled
successfulRuns: number; // cells that completed without error
costUsd: number;
meanLatencyMs: number;
}

buildBenchmarkReport(runs, meta)

Folds per-cell results into a BenchmarkReport. Pure (no I/O), so it unit-tests against synthetic results without any LLM.

function buildBenchmarkReport(
runs: BenchmarkRunResult[],
meta: BuildBenchmarkReportMeta,
): BenchmarkReport;

renderScorecard(report, format)

Renders a report as plain aligned text, a markdown table, or pretty-printed JSON. The markdown format marks the baseline row and is meant for pasting into a doc or PR.

function renderScorecard(
report: BenchmarkReport,
format: "table" | "markdown" | "json",
): string;

estimateCostUsd

Best-effort USD cost estimate used internally to enforce maxCostUsd.

Contract types

These live in @flow-state-dev/core so both the engine (@flow-state-dev/testing) and pattern adapters (@flow-state-dev/patterns) can import them.

BenchmarkTask

interface BenchmarkTask {
id: string;
category: BenchmarkCategory;
prompt: string;
rubric: string[]; // locked, published criteria the judge scores against
expected?: unknown; // optional reference passed to code scorers
metadata?: Record<string, unknown>;
}

type BenchmarkCategory =
| "reasoning"
| "multi-step-research"
| "critique-revision"
| "tool-use";

BenchmarkSubject

interface BenchmarkSubject {
name: string;
kind: "pattern" | "baseline";
sequencer: SequencerDefinition<any, any>;
mapTask: (task: BenchmarkTask) => unknown; // map the generic task onto this subject's input
}

BenchmarkAdapter and BenchmarkRegistry

A pattern's adapter names the pattern and builds a subject from shared options. A registry is a lookup of pattern name to adapter, resolved by comparePatterns.

interface BenchmarkAdapterOptions {
model: string; // executor model every generator in the materialized pattern uses
uses?: UsesSlot; // capabilities forwarded into the pattern's internal generators
}

interface BenchmarkAdapter {
patternName: string;
build: (opts: BenchmarkAdapterOptions) => BenchmarkSubject;
}

type BenchmarkRegistry = Record<string, BenchmarkAdapter>;