Agent skill

start-evals

Start AI evals without overengineering. Create your first 20 test cases in a spreadsheet using PM-Friendly Evals approach.

View SKILL.md on GitHub Repository

Stars 13

Forks 2

Install this agent skill to your Project

npx add-skill https://github.com/breethomas/bette-think/tree/main/plugins/bette-think/skills/start-evals

SKILL.md

Start Evals

Launch your AI evaluation process using the PM-Friendly Evals approach (Aman Khan + Hamel Husain).

Start with 20 test cases in a spreadsheet. Scale when ready. Error analysis > automation.

Entry Point

When this skill is invoked, start with:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 START EVALS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Start with 20 test cases. Scale when ready.

What AI feature are you evaluating?

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Usage

/start-evals [feature-name]

Examples:

/start-evals "AI product recommendations" - Generate test cases
/start-evals --create-project - Create Linear project for tracking
/start-evals "customer support AI" --count 50 - Generate 50 test cases

What Happens

Invokes the eval-generator agent
Asks about your AI feature and quality criteria
Generates 20 test cases (15 happy path + 5 edge cases)
Provides spreadsheet template and workflow
Optionally creates Linear project for tracking

The Philosophy

Good -> Better -> Best progression:

Stage	Test Cases	Process	Tool
Good (Week 1)	20	Manual review	Spreadsheet
Better (Month 1-2)	50-100	LLM-as-judge	Weekly reviews
Best (Month 3+)	200+	Automated	CI/CD integration

Start here. You're at "Good." Don't jump to automation.

What You'll Get

AI Evals Starter Kit: Product Recommendations

HAPPY PATH (15 cases):

1. Input: "Recommend a laptop under $800 for college"
   Expected: Mid-range laptops with student-friendly features, under budget
   Pass criteria: All recommendations < $800, suitable for students

2. Input: "Best phone for photography"
   Expected: High-end phones with excellent cameras
   Pass criteria: Focus on camera quality, not price

...

EDGE CASES (5 cases):

16. Input: "Phone for elderly person"
    Expected: Simple, large screen, easy to use
    Pass criteria: Prioritizes simplicity over features
    Why it's tricky: Must understand implicit needs

...

Week 1 Workflow (2-3 hours)

Copy test cases to spreadsheet (10 min)
Run your AI against each input (1-2 hours)
Record actual outputs
Mark pass/fail
Look for patterns in failures (30 min)

After 1-2 Weeks

Pass Rate	Action
80%+	Add 10 more test cases
<80%	Fix issues, rerun
50-100 cases	Graduate to "Better" approach

Common Questions

Q: 20 seems like too few. Should I start with 100? A: No. 20 cases covering your core use case > 100 cases you never run.

Q: How long does running 20 tests take? A: First time: 30-60 min. After that: 15-20 min per run.

Q: Do I need special tools? A: No. Spreadsheet works great. Graduate to tools when manual gets painful.

Ready to Scale?

Signal	Next Step
You have 50+ test cases or see production failures	`/upgrade-evals` — Systematic error analysis on real traces
You need more diverse test inputs	`/generate-test-data` — Dimension-based synthetic data
Your AI feature uses retrieval (search, knowledge base)	`/eval-rag` — Separate retrieval from generation evaluation

Related Commands

/upgrade-evals - Error analysis on real traces (next step after this)
/build-judge - LLM-as-Judge for subjective failure modes
/generate-test-data - Diverse synthetic test inputs
/eval-rag - RAG-specific retrieval + generation evaluation
/calibrate - Ongoing post-launch calibration
/ai-health-check - Full pre-launch readiness audit
/ai-cost-check - Economic validation

Framework: PM-Friendly Evals (Aman Khan + Hamel Husain) Key insight: "Error analysis is the most important activity. Start with 20 cases in a spreadsheet."

Maintainer

breethomas Core maintainer

Source details

Full Name: breethomas/bette-think
Branch: main
Path in repo: plugins/bette-think/skills/start-evals
License: Other
Topics: ai claude-code ai-agents pm-frameworks pm-tools product-management

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

breethomas/bette-think

four-fits

Find which fit is broken before you burn cash scaling. Brian Balfour's framework for validating sustainable growth readiness.

13 2

Explore

breethomas/bette-think

project-health

Deep-dive health check on a single Linear project. Produces assessment with 7 dimensions - On Track / At Risk / Stalled.

13 2

Explore

breethomas/bette-think

prompt-engineering

Expert prompt optimization system for building production-ready AI features. Use when users request help improving prompts, want to create system prompts, need prompt review/critique, ask for prompt optimization strategies, want to analyze prompt effectiveness, mention prompt engineering best practices, request prompt templates, or need guidance on structuring AI instructions. Also use when users provide prompts and want suggestions for improvement.

13 2

Explore

breethomas/bette-think

strategy-session

Your product soundboard. Work through product decisions conversationally - Claude gathers context, challenges assumptions, captures decisions, and creates Linear issues.

13 2

Explore

breethomas/bette-think

ai-debug

Diagnose why an AI feature is underperforming, hallucinating, or behaving inconsistently. Uses 4D audit to work backwards from symptoms to root cause.

13 2

Explore

breethomas/bette-think

shape-up

Shape work using the Shape Up methodology (Ryan Singer, Basecamp). Walk through the 4-step shaping process to create pitches ready for betting. Distinguishes between established product mode (fixed time, variable scope) and new product mode (looser constraints). Use when planning cycle work, writing pitches, or coaching PMs on shaping.

13 2

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Start Evals

Entry Point

Usage

What Happens

The Philosophy

What You'll Get

Week 1 Workflow (2-3 hours)

After 1-2 Weeks

Common Questions

Ready to Scale?

Related Commands

Recommended Agent Skills

four-fits

project-health

prompt-engineering

strategy-session

ai-debug

shape-up