Know when your agents break before your users do

Automated quality baselines for AI agent pipelines. Run your agents against structured evaluation harnesses, catch regressions across deploys, and ship with confidence.

$ assayline run --suite onboarding --cohort 005

Running 12 evaluation scenarios...

PASS phase_ordering ............... 12/12

PASS tool_call_correctness ....... 11/12

PASS output_completeness ......... 12/12

FAIL state_transition_validity ... 10/12

PASS response_quality ............ 11/12

Score: 93.3% (baseline: 91.7% | delta: +1.6%)

2 regressions detected in state_transition_validity

How it works

Three layers between your agent and production regressions

01 / HARNESS

Define your baseline

Capture a golden run of your agent pipeline. Every tool call, state transition, and output becomes your quality reference point.

02 / EVALUATE

Run cohort suites

Execute your agent against structured scenarios. Score phase ordering, tool correctness, output completeness, and reasoning quality.

03 / GATE

Block regressions

Set thresholds per dimension. If a deploy degrades quality below baseline, it doesn't ship. Automatic, no human review needed.

The problem

Agent quality is invisible until it isn't

Model updates silently change agent behavior across your entire pipeline
Prompt tweaks that fix one scenario break three others you forgot existed
Multi-step agents fail in ways that single-turn evals never catch
You find out about quality regressions from users, not from your tests
"It worked when I tested it" is not a quality strategy

Quality you can measure, regressions you can prevent

AssayLine turns agent evaluation from a manual spot-check into a continuous, automated quality system that runs on every change.