Agent skill
start-evals
Start AI evals without overengineering. Create your first 20 test cases in a spreadsheet using PM-Friendly Evals approach.
Install this agent skill to your Project
npx add-skill https://github.com/breethomas/bette-think/tree/main/plugins/bette-think/skills/start-evals
SKILL.md
Start Evals
Launch your AI evaluation process using the PM-Friendly Evals approach (Aman Khan + Hamel Husain).
Start with 20 test cases in a spreadsheet. Scale when ready. Error analysis > automation.
Entry Point
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
START EVALS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start with 20 test cases. Scale when ready.
What AI feature are you evaluating?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Usage
/start-evals [feature-name]
Examples:
/start-evals "AI product recommendations"- Generate test cases/start-evals --create-project- Create Linear project for tracking/start-evals "customer support AI" --count 50- Generate 50 test cases
What Happens
- Invokes the eval-generator agent
- Asks about your AI feature and quality criteria
- Generates 20 test cases (15 happy path + 5 edge cases)
- Provides spreadsheet template and workflow
- Optionally creates Linear project for tracking
The Philosophy
Good -> Better -> Best progression:
| Stage | Test Cases | Process | Tool |
|---|---|---|---|
| Good (Week 1) | 20 | Manual review | Spreadsheet |
| Better (Month 1-2) | 50-100 | LLM-as-judge | Weekly reviews |
| Best (Month 3+) | 200+ | Automated | CI/CD integration |
Start here. You're at "Good." Don't jump to automation.
What You'll Get
AI Evals Starter Kit: Product Recommendations
HAPPY PATH (15 cases):
1. Input: "Recommend a laptop under $800 for college"
Expected: Mid-range laptops with student-friendly features, under budget
Pass criteria: All recommendations < $800, suitable for students
2. Input: "Best phone for photography"
Expected: High-end phones with excellent cameras
Pass criteria: Focus on camera quality, not price
...
EDGE CASES (5 cases):
16. Input: "Phone for elderly person"
Expected: Simple, large screen, easy to use
Pass criteria: Prioritizes simplicity over features
Why it's tricky: Must understand implicit needs
...
Week 1 Workflow (2-3 hours)
- Copy test cases to spreadsheet (10 min)
- Run your AI against each input (1-2 hours)
- Record actual outputs
- Mark pass/fail
- Look for patterns in failures (30 min)
After 1-2 Weeks
| Pass Rate | Action |
|---|---|
| 80%+ | Add 10 more test cases |
| <80% | Fix issues, rerun |
| 50-100 cases | Graduate to "Better" approach |
Common Questions
Q: 20 seems like too few. Should I start with 100? A: No. 20 cases covering your core use case > 100 cases you never run.
Q: How long does running 20 tests take? A: First time: 30-60 min. After that: 15-20 min per run.
Q: Do I need special tools? A: No. Spreadsheet works great. Graduate to tools when manual gets painful.
Ready to Scale?
| Signal | Next Step |
|---|---|
| You have 50+ test cases or see production failures | /upgrade-evals — Systematic error analysis on real traces |
| You need more diverse test inputs | /generate-test-data — Dimension-based synthetic data |
| Your AI feature uses retrieval (search, knowledge base) | /eval-rag — Separate retrieval from generation evaluation |
Related Commands
/upgrade-evals- Error analysis on real traces (next step after this)/build-judge- LLM-as-Judge for subjective failure modes/generate-test-data- Diverse synthetic test inputs/eval-rag- RAG-specific retrieval + generation evaluation/calibrate- Ongoing post-launch calibration/ai-health-check- Full pre-launch readiness audit/ai-cost-check- Economic validation
Framework: PM-Friendly Evals (Aman Khan + Hamel Husain) Key insight: "Error analysis is the most important activity. Start with 20 cases in a spreadsheet."
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
four-fits
Find which fit is broken before you burn cash scaling. Brian Balfour's framework for validating sustainable growth readiness.
project-health
Deep-dive health check on a single Linear project. Produces assessment with 7 dimensions - On Track / At Risk / Stalled.
prompt-engineering
Expert prompt optimization system for building production-ready AI features. Use when users request help improving prompts, want to create system prompts, need prompt review/critique, ask for prompt optimization strategies, want to analyze prompt effectiveness, mention prompt engineering best practices, request prompt templates, or need guidance on structuring AI instructions. Also use when users provide prompts and want suggestions for improvement.
strategy-session
Your product soundboard. Work through product decisions conversationally - Claude gathers context, challenges assumptions, captures decisions, and creates Linear issues.
ai-debug
Diagnose why an AI feature is underperforming, hallucinating, or behaving inconsistently. Uses 4D audit to work backwards from symptoms to root cause.
shape-up
Shape work using the Shape Up methodology (Ryan Singer, Basecamp). Walk through the 4-step shaping process to create pitches ready for betting. Distinguishes between established product mode (fixed time, variable scope) and new product mode (looser constraints). Use when planning cycle work, writing pitches, or coaching PMs on shaping.
Didn't find tool you were looking for?