Agent skill

eval

Evaluate agent quality across three modes — without BK, BK grep-only, and BK full

Stars 11
Forks 3

Install this agent skill to your Project

npx add-skill https://github.com/blueraai/bluera-knowledge/tree/main/skills/eval

SKILL.md

Agent Quality Evaluation

Compare how well Claude answers library questions across three access levels:

  • Without BK — web search + training knowledge only
  • BK Grep — Grep/Read/Glob on cloned repos, no vector search
  • BK Full — vector search + get_full_context + Grep/Read

Arguments

Parse $ARGUMENTS:

  • No arguments: Show usage help
  • Quoted string: Run eval for that single question
  • --predefined: Run all predefined queries
  • --predefined N: Run predefined query #N only

Workflow

  1. Prerequisites: Call execute with { command: "stores" } to list stores. Abort if none.
  2. Resolve queries: Load from $CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml or use arbitrary query.
  3. Load templates: Read agent prompts + judge rubric from $CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/
  4. Spawn 3 agents in parallel per query (replace {{QUESTION}}, {{STORES}}, {{STORE_PATHS}})
  5. Judge: Score all 4 criteria (1-5): Accuracy, Specificity, Completeness, Source Grounding

Detailed procedures: references/procedures.md

Output format: references/output-format.md

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results