Agent skill
agent-comparison
A/B test agent variants for quality and token cost.
Install this agent skill to your Project
npx add-skill https://github.com/notque/claude-code-toolkit/tree/main/skills/agent-comparison
SKILL.md
Agent Comparison Skill
Compare agent variants through controlled A/B benchmarks. Runs identical tasks on both agents, grades output quality with domain-specific checklists, and reports total session token cost to a working solution. This skill is exclusively for agent variant comparison — use agent-evaluation for single-agent assessment, and skill-eval for skill testing.
Instructions
Phase 1: PREPARE
Goal: Create benchmark environment and validate both agent variants exist.
Read and follow the repository CLAUDE.md before starting any execution.
Step 1: Analyze original agent
# Count original agent size
wc -l agents/{original-agent}.md
# Identify major sections
grep "^## " agents/{original-agent}.md
# Count code examples (candidates for removal in compact version)
grep -c '```' agents/{original-agent}.md
Step 2: Create or validate compact variant
If creating a compact variant, preserve:
- YAML frontmatter (name, description, routing)
- Core patterns and principles
- Error handling philosophy
Remove or condense:
- Lengthy code examples (keep 1-2 representative per pattern)
- Verbose explanations (condense to bullet points)
- Redundant instructions and changelogs
Target 10-15% of original size while keeping essential knowledge. Remove redundancy, not capability — stripping error handling patterns or concurrency guidance creates an unfair comparison because the compact agent is missing essential knowledge rather than expressing it concisely.
Step 3: Validate compact variant structure
# Verify YAML frontmatter
head -20 agents/{compact-agent}.md | grep -E "^(name|description):"
# Compare sizes
echo "Original: $(wc -l < agents/{original-agent}.md) lines"
echo "Compact: $(wc -l < agents/{compact-agent}.md) lines"
Step 4: Create benchmark directory and prepare prompts
mkdir -p benchmark/{task-name}/{full,compact}
Write the task prompt ONCE, then copy it for both agents. Both agents must receive the exact same task description, character-for-character, because different requirements produce different solutions and invalidate all measurements.
Keep benchmark scripts simple — no speculative features or configurable frameworks that were not requested.
Gate: Both agent variants exist with valid YAML frontmatter. Benchmark directories created. Identical task prompts written. Proceed only when gate passes.
Phase 2: BENCHMARK
Goal: Run identical tasks on both agents, capturing all metrics.
Step 1: Run simple task benchmark (2-3 tasks)
Use algorithmic problems with clear specifications (e.g., Advent of Code Day 1-6). Simple tasks establish a baseline — if an agent fails here, it has fundamental issues. Running multiple simple tasks is necessary because a single data point is sensitive to task selection bias and cannot distinguish luck from systematic quality.
Spawn both agents in parallel using Task tool. Each agent runs in a separate session to avoid contamination:
Task(
prompt="[exact task prompt]\nSave to: benchmark/{task}/full/",
subagent_type="{full-agent}"
)
Task(
prompt="[exact task prompt]\nSave to: benchmark/{task}/compact/",
subagent_type="{compact-agent}"
)
Run in parallel to avoid caching effects or system load variance skewing results.
Step 2: Run complex task benchmark (1-2 tasks)
Use production-style problems that require concurrency, error handling, edge case anticipation — these are where quality differences emerge because simple tasks mask differences in edge case handling. See references/benchmark-tasks.md for standard tasks.
Recommended complex tasks:
- Worker Pool: Rate limiting, graceful shutdown, panic recovery
- LRU Cache with TTL: Generics, background goroutines, zero-value semantics
- HTTP Service: Middleware chains, structured errors, health checks
Step 3: Capture metrics for each run
Record immediately after each agent completes — record immediately after each agent completes, because delayed recording loses precision. Track input/output token counts per turn where visible, since total session cost (not just prompt size) is what matters.
| Metric | Full Agent | Compact Agent |
|---|---|---|
| Tests pass | X/X | X/X |
| Race conditions | X | X |
| Code lines (main) | X | X |
| Test lines | X | X |
| Session tokens | X | X |
| Wall-clock time | Xm Xs | Xm Xs |
| Retry cycles | X | X |
Step 4: Run tests with race detector
cd benchmark/{task-name}/full && go test -race -v -count=1
cd benchmark/{task-name}/compact && go test -race -v -count=1
Use -count=1 to disable test caching. All generated code must pass the same test suite with the -race flag because race conditions are automatic quality failures. Record them but record them as findings for the agent being tested.
Gate: Both agents completed all tasks. Metrics captured for every run. Test output saved. Proceed only when gate passes.
Phase 3: GRADE
Goal: Score code quality beyond pass/fail using domain-specific checklists.
Step 1: Create quality checklist BEFORE reviewing code
Define criteria before seeing results to prevent bias — inventing criteria after seeing one agent's output skews the comparison. See references/grading-rubric.md for standard rubrics.
| Criterion | 5/5 | 3/5 | 1/5 |
|---|---|---|---|
| Correctness | All tests pass, no race conditions | Some failures | Broken |
| Error Handling | Comprehensive, production-ready | Adequate | None |
| Idioms | Exemplary for the language | Acceptable | Anti-patterns |
| Documentation | Thorough | Adequate | None |
| Testing | Comprehensive coverage | Basic | Minimal |
Step 2: Score each solution independently
Grade each agent's code on all five criteria. Score one agent completely before starting the other. Report facts and show command output rather than describing it — every claim must be backed by measurable data (tokens, test counts, quality scores).
## {Agent} Solution - {Task}
| Criterion | Score | Notes |
|-----------|-------|-------|
| Correctness | X/5 | |
| Error Handling | X/5 | |
| Idioms | X/5 | |
| Documentation | X/5 | |
| Testing | X/5 | |
| **Total** | **X/25** | |
Step 3: Document specific bugs with production impact
For each bug found, record:
### Bug: {description}
- Agent: {which agent}
- What happened: {behavior}
- Correct behavior: {expected}
- Production impact: {consequence}
- Test coverage: {did tests catch it? why not?}
"Tests pass" is necessary but not sufficient — production bugs often pass tests. Clear() returning nothing passes if no test checks the return value. TTL=0 bugs pass if no test uses zero TTL. Apply the domain-specific quality checklist rather than relying only on test pass rates, because tests can miss goroutine leaks, wrong semantics, and other production issues.
Step 4: Calculate effective cost
effective_cost = total_tokens * (1 + bug_count * 0.25)
An agent using 194k tokens with 0 bugs has better economics than one using 119k tokens with 5 bugs requiring fixes. The metric that matters is total cost to working, production-quality solution — not prompt size, because prompt is a one-time cost while reasoning tokens dominate sessions. Check quality scores before claiming token savings, since savings that come from cutting corners are not real savings.
Gate: Both solutions graded with evidence. Specific bugs documented with production impact. Effective cost calculated. Proceed only when gate passes.
Phase 4: REPORT
Goal: Generate comparison report with evidence-backed verdict.
Step 1: Generate comparison report
Use the report template from references/report-template.md. Include:
- Executive summary with clear winner per metric
- Per-task results with metrics tables
- Token economics analysis (one-time prompt cost vs session cost)
- Specific bugs found and their production impact
- Verdict based on total evidence
Generate a side-by-side comparison report with a clear verdict.
Step 2: Run comparison analysis
# TODO: scripts/compare.py not yet implemented
# Manual alternative: compare benchmark outputs side-by-side
diff benchmark/{task-name}/full/ benchmark/{task-name}/compact/
Step 3: Analyze token economics
The key economic insight: agent prompts are a one-time cost per session. Everything after — reasoning, code generation, debugging, retries — costs tokens on every turn. When a micro agent produces correct code, it uses approximately the same total tokens. The savings appear only when it cuts corners.
| Pattern | Description |
|---|---|
| Large agent, low churn | High initial cost, fewer retries, less debugging |
| Small agent, high churn | Low initial cost, more retries, more debugging |
Our data showed a 57-line agent used 69.5k tokens vs 69.6k for a 3,529-line agent on the same correct solution — prompt size alone does not determine cost.
Step 4: State verdict with evidence
The verdict must be backed by data — back the verdict with quality and cost data. Include:
- Which agent won on simple tasks (expected: equivalent)
- Which agent won on complex tasks (expected: full agent)
- Total session cost comparison
- Effective cost comparison (with bug penalty)
- Clear recommendation for when to use each variant
See references/methodology.md for the complete testing methodology with December 2024 data.
Step 5: Clean up
Remove temporary benchmark files and debug outputs. Keep only the comparison report and generated code.
Gate: Report generated with all metrics. Verdict stated with evidence. Report saved to benchmark directory.
Phase 5: OPTIMIZE (optional — invoked explicitly)
Goal: Run an automated optimization loop that improves a markdown target's frontmatter description using trigger-rate eval tasks, then selects the best measured variants through beam search or single-path search.
This phase is for routing/trigger optimization, not full code-generation benchmarking. Invoke it when the user says "optimize this skill", "optimize the description", or "run autoresearch". The existing manual A/B comparison (Phases 1-4) remains the path for full agent benchmarking.
Step 1: Validate optimization target and goal
Confirm the target file exists, has YAML frontmatter with a description, and the optimization goal is clear:
# Target must be a markdown file with frontmatter description
test -f skills/{target}/SKILL.md
rg -n '^description:' skills/{target}/SKILL.md
# Goal should be specific and measurable
# Good: "improve error handling instructions"
# Bad: "make it better"
Step 2: Prepare trigger-rate eval tasks
python3 skills/agent-comparison/scripts/optimize_loop.py \
--target skills/{target}/SKILL.md \
--goal "{optimization goal}" \
--benchmark-tasks skills/agent-comparison/references/optimization-tasks.example.json \
--train-split 0.6 \
--verbose
Supported task schemas:
- Flat
taskslist with optional"split": "train" | "test"per task - Top-level
trainandtestarrays
Every task must include:
query: the routing prompt to testshould_trigger: whether the target should trigger for that prompt
If no split markers are present, the loop does a reproducible random split with seed 42.
Step 3: Run baseline evaluation
The loop automatically evaluates the unmodified target against the train set before starting iteration. This establishes the score to beat, and records a held-out baseline if test tasks exist.
Step 4: Enter optimization loop
The optimize_loop.py script handles the full loop:
- Calls
generate_variant.pyto propose a new frontmatterdescriptionthroughclaude -p - Evaluates each variant against train tasks
- Runs either:
- single-path hill climbing:
--beam-width 1 --candidates-per-parent 1 - beam search with top-K retention: keep the best
Kimproving candidates each round
- single-path hill climbing:
- Accepts variants that beat their parent by more than
--min-gain(default 0.02) - Rejects variants that don't improve or break hard gates
- Checks held-out test set every
--holdout-check-cadencerounds for Goodhart divergence - Stops on convergence (
--revert-streak-limitrounds without any ACCEPT), Goodhart alarm, or max iterations
python3 skills/agent-comparison/scripts/optimize_loop.py \
--target skills/{target}/SKILL.md \
--goal "{optimization goal}" \
--benchmark-tasks skills/agent-comparison/references/optimization-tasks.example.json \
--max-iterations 20 \
--min-gain 0.02 \
--train-split 0.6 \
--beam-width 3 \
--candidates-per-parent 2 \
--revert-streak-limit 8 \
--holdout-check-cadence 5 \
--report optimization-report.html \
--output-dir evals/iterations \
--verbose
Omit --model to use Claude Code's configured default model, or pass it explicitly if you need a specific override.
The --report flag generates a live HTML dashboard that auto-refreshes every 10 seconds, showing a convergence chart, iteration table, and review/export controls.
Recommended modes:
- Short default optimization: default flags only
- Fast single-path optimization:
--beam-width 1 --candidates-per-parent 1 --max-iterations 3 --revert-streak-limit 3 - True autoresearch sweep:
--max-iterations 20 --beam-width 3 --candidates-per-parent 2 --revert-streak-limit 20 - Conservative search with strict keeps: raise
--min-gainabove0.02 - Exploratory search that accepts small wins: use
--min-gain 0.0
Live eval defaults are intentionally short:
- one optimization round
- three trigger-eval runs per query
- one trigger-eval worker
- no holdout cadence unless explicitly requested
For real repo skills at skills/<name>/SKILL.md, the live evaluator now prefers an isolated git worktree so the candidate content is scored at the real skill path. This is the default --eval-mode auto behavior and avoids scoring the installed skill instead of the candidate.
The registered-skill path also evaluates the current working copy, not just HEAD, so local uncommitted edits are measured correctly.
Step 5: Present results in UI
If you passed --report optimization-report.html, open the generated file in a browser. The report shows:
- Progress dashboard (status, baseline vs best, accepted/rejected counts)
- Convergence chart (train solid line, held-out dashed line, baseline dotted)
- Iteration table with verdict, composite score, delta, and change summary
- Expandable inline diffs per iteration (click any row)
Step 6: Review accepted snapshots
Not all ACCEPT iterations are real improvements — some may be harness artifacts. The user reviews the accepted iterations as candidate snapshots from the original target:
- Inspect each accepted iteration's diff in the report
- Use "Preview Combined" only as a comparison aid in the UI
- Use "Export Selected" to download a review JSON describing the selected snapshot diff
- In beam mode, review the retained frontier candidates first; they are the strongest candidates from the latest round
Step 7: Apply selected improvements to target file
Apply one reviewed improvement to the original target file.
- If you want the best single accepted variant, use
evals/iterations/best_variant.md. - Beam search still writes a single
best_variant.md: the highest-scoring accepted candidate seen anywhere in the run. - Choose scope deliberately:
description-onlyfor routing-trigger workbody-onlyfor behavioral work on the skill instructions themselves
- If you exported selected diffs, treat that JSON as review material only. It is not auto-applied by the current tooling, and the current workflow does not support merging multiple accepted diffs into a generated patch.
# Review the best accepted variant before applying
cat evals/iterations/best_variant.md | head -20
# Replace the target with the best accepted variant
cp evals/iterations/best_variant.md skills/{target}/SKILL.md
Step 8: Run final evaluation on FULL task set (train + test)
After applying improvements, run a final evaluation on ALL tasks (not just train) to verify the improvements generalize. Use evaluation-only mode by rerunning the optimizer with --max-iterations 0, which records the baseline for the current file without generating fresh variants:
python3 skills/agent-comparison/scripts/optimize_loop.py \
--target skills/{target}/SKILL.md \
--goal "{same goal}" \
--benchmark-tasks {full-task-file}.json \
--max-iterations 0 \
--report optimization-report.html \
--output-dir evals/final-check \
--verbose
Compare final scores to the baseline to confirm net improvement. In beam mode, the final report and results.json also include:
beam_widthcandidates_per_parentholdout_check_cadence- per-iteration frontier metadata (
selected_for_frontier,frontier_rank,parent_iteration)
Step 9: Record in learning-db
python3 scripts/learning-db.py learn \
--skill agent-comparison \
"autoresearch: {target} improved {baseline}→{best} over {iterations} iterations. \
Accepted: {accepted}/{total}. Stop: {reason}. Changes: {summaries}"
Gate: Optimization complete. Results reviewed. Cherry-picked improvements applied and verified against full task set. Results recorded.
Current Reality Check
The current optimizer is in a solid state for:
- deterministic proof runs
- isolated live evaluation of existing registered skills
- short live optimization of
read-only-ops, with the accepted description change now applied and validated againstreferences/read-only-ops-short-tasks.json - short live body optimization of
socratic-debugging, with the accepted instruction-body update now applied and validated againstreferences/socratic-debugging-body-short-tasks.json, now producing clean skill-triggered first-turn outputs instead of fallback chatter
One live-harness caveat remains:
- temporary renamed skill copies do not yet show reliable live trigger improvements through the dynamic command alias path
That caveat does not affect deterministic proof runs or live checks against existing registered skills, but it does mean the current system is stronger for optimizing real in-repo skills than arbitrary renamed temp clones.
For body optimization runs, the blind evaluator now rejects responses that:
- never triggered the target skill
- mention blocked skill/tool access
- fall back into generic "I'll guide you directly" behavior
Optional Extensions
These are off by default. Enable explicitly when needed:
- Multiple Runs: Run each benchmark 3x to account for variance
- Blind Evaluation: Hide agent identity during quality grading
- Extended Benchmark Suite: Run additional domain-specific tests
- Historical Tracking: Compare against previous benchmark runs
Error Handling
Error: "Agent Type Not Found"
Cause: Agent not registered or name misspelled Solution: Verify agent file exists in agents/ directory. Restart Claude Code client to pick up new definitions.
Error: "Tests Fail with Race Condition"
Cause: Concurrent code has data races Solution: This is a real quality difference. Record as a finding in the grade. Record as a finding for the agent being tested.
Error: "Different Test Counts Between Agents"
Cause: Agents wrote different test suites Solution: Valid data point. Grade on test coverage and quality, not raw count. More tests is not always better.
Error: "Timeout During Agent Execution"
Cause: Complex task taking too long or agent stuck in retry loop Solution: Note the timeout and number of retries attempted. Record as incomplete with partial metrics. Increase timeout limit if warranted, but excessive retries are a quality signal — an agent that needs many retries is less efficient regardless of final outcome.
References
${CLAUDE_SKILL_DIR}/references/methodology.md: Complete testing methodology with December 2024 data${CLAUDE_SKILL_DIR}/references/grading-rubric.md: Detailed grading criteria and quality checklists${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md: Standard benchmark task descriptions and prompts${CLAUDE_SKILL_DIR}/references/report-template.md: Comparison report template with all required sections
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
voice-writer
Unified voice content generation pipeline with mandatory validation and joy-check. 9-phase pipeline: LOAD, GROUND, GENERATE, VALIDATE, REFINE, JOY-CHECK, OUTPUT, CLEANUP. Use when writing articles, blog posts, or any content that uses a voice profile. Use for "write article", "blog post", "write in voice", "generate content", "draft article", "write about".
image-auditor
Non-destructive image validation for accessibility and health.
video-editing
Video editing pipeline: cut footage, assemble clips via FFmpeg and Remotion.
comment-quality
Review and fix temporal references in code comments.
e2e-testing
Playwright-based end-to-end testing workflow.
anti-ai-editor
Remove AI-sounding patterns from content.
Didn't find tool you were looking for?