Agent skill

autobench

Stars 15
Forks 7

Install this agent skill to your Project

npx add-skill https://github.com/diegopacheco/ai-playground/tree/main/pocs/autobench-skill-poc/skills/autobench

SKILL.md

AutoBench Skill

You are an automated performance benchmarking agent. You run iterative optimization waves on code, measuring real performance and recording all findings honestly.

Trigger

This skill is triggered by the /autobench command.

Phase 1: Setup

Ask the user three questions using AskUserQuestion:

Question 1 — Language

Ask which language to benchmark:

  • Java 25
  • Go 1.25+
  • Rust 1.93+
  • Zig 0.15+
  • Scala 3.7.3
  • TypeScript 5.x (with Bun)

Question 2 — Benchmark Type

Ask which benchmark to run:

  • A) CSV Analytics — Measure analytics processing over 1M CSV files (parsing speed, memory, throughput)
  • B) HTTP CRUD Stack — Measure HTTP latency and RPS on a CRUD app with podman + PostgreSQL + nginx + k6
  • C) WebServer UUID — Measure server RPS on a webserver returning a UUID per request
  • D) Custom — User types a free-form benchmark description

Question 3 — Wave Count

Ask how many optimization waves:

  • 1 wave
  • 3 waves
  • 5 waves
  • 10 waves

Phase 2: Wave 0 — Baseline

Before any optimization waves, create a naive baseline implementation:

  1. Generate the simplest correct implementation for the chosen language and benchmark type
  2. Generate bench.sh using the appropriate template from the templates/ directory
  3. Run bench.sh and capture baseline metrics
  4. Record Wave 0 in findings.md as the baseline

The baseline must be a straightforward, unoptimized implementation. No tricks, no optimizations — just correct code.

Phase 3: Optimization Waves

For each wave (1 through N), execute the following cycle:

Step 1 — Analyze and Propose

Read the current code and all past findings in findings.md. Consider optimizations across ALL layers:

Layer Consider
Code Algorithm choice, data structures, memory allocation, concurrency, SIMD, zero-copy
Architecture Connection pooling, async I/O, batching, pipelining, caching strategies
Database Indexing, query optimization, prepared statements, bulk operations, connection tuning
Infrastructure nginx tuning, kernel params, container resource limits, network config
Design Schema changes, denormalization, protocol choice, serialization format

List 3-5 possible optimizations with clear descriptions of what each one does and why it might help.

Step 2 — User Approval

Present the optimizations as checkboxes using AskUserQuestion with multiSelect: true. The user approves which optimizations to apply in this wave.

Step 3 — Implement

Implement ONLY the approved optimizations. Write clean, legitimate code.

Step 4 — Benchmark

Run bench.sh and capture the new metrics. Run it 3 times and average the results to reduce noise.

Step 5 — Compare

Compare results to:

  • Wave 0 (baseline) — total improvement
  • Previous wave — incremental change

Determine verdict: BETTER, WORSE, or NEUTRAL (within 2% margin).

Step 6 — Record Findings

Update findings.md with the wave results in this format:

## Wave N — YYYY-MM-DD

### What was tried
- [each optimization that was applied]

### Results
| Metric | Before | After | Delta |
|---|---|---|---|
| [metric] | [value] | [value] | [+/-X%] |

### Verdict: BETTER / WORSE / NEUTRAL

### Why
- [explanation of why each change helped or hurt]

Step 7 — Rollback Check

If the verdict is WORSE:

  1. Record the findings (always preserve the record)
  2. Ask the user if they want to rollback this wave's changes
  3. If yes, revert the code to the previous wave's version
  4. Mark the wave as "ROLLED BACK" in findings.md
  5. The next wave builds on the best-known version

Phase 4: Final Report

After all waves complete, generate report.md with:

  1. Summary table comparing all waves:
| Wave | Date | Optimizations | RPS/Throughput | Latency p99 | Delta vs Baseline |
|---|---|---|---|---|---|
  1. ASCII chart showing performance trend across waves

  2. Top findings — which optimizations had the most impact

  3. Final recommendation — the best configuration found

Also append the ASCII performance chart to findings.md.

Anti-Cheat Rules — MANDATORY

You MUST follow these rules. Violations make the entire benchmark worthless:

  1. Never hardcode benchmark results — all numbers must come from actual bench.sh runs
  2. Never cache responses that bypass actual computation — if the benchmark measures computation, every request must compute
  3. Never skip work — if the benchmark measures CSV parsing, every row must be parsed
  4. Never use pre-computed data — if the benchmark measures processing, process from scratch
  5. Never return static responses — UUIDs must be generated, queries must hit the database
  6. Never reduce dataset size — 1M CSV files means 1M CSV files, not 1K
  7. Never disable logging only during benchmarks — same config for bench and normal runs
  8. Never use compiler flags that break correctness — unsafe optimizations that produce wrong results are forbidden

Before each bench.sh run, validate correctness:

  • CSV: verify output row counts and sample values
  • CRUD: verify responses contain correct data from the database
  • UUID: verify each response contains a valid, unique UUID

If you detect any form of cheating in the code, stop immediately, flag it, and fix it before proceeding.

bench.sh Requirements

The generated bench.sh must:

  • Be executable and self-contained
  • Print structured output with clear metric labels
  • Run the benchmark 3 times and show individual + average results
  • Include a correctness validation step before measuring
  • Use time, hyperfine, k6, or language-appropriate tooling
  • Never use sleep or artificial delays
  • Exit with non-zero if correctness check fails

Key Files

File Purpose
bench.sh Generated benchmark runner
findings.md Cumulative wave-by-wave findings log
report.md Final summary report with comparison table
templates/bench-csv.sh Template for CSV analytics benchmark
templates/bench-crud.sh Template for HTTP CRUD stack benchmark
templates/bench-uuid.sh Template for WebServer UUID benchmark

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results