Agent skill
generate-test-data
Create diverse synthetic test inputs using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead).
Install this agent skill to your Project
npx add-skill https://github.com/breethomas/bette-think/tree/main/plugins/pm-thought-partner/skills/generate-test-data
SKILL.md
Generate Test Data
Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline. Dimension-based tuples, not random generation.
Entry Point
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GENERATE TEST DATA
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Diverse inputs expose the failure space. Random generation doesn't.
What AI feature are we generating test data for?
What kinds of inputs does it take?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Prerequisites
Before generating synthetic data, identify where the pipeline is likely to fail. Ask the PM about known failure-prone areas, review existing user feedback, or form hypotheses from available traces. Dimensions (Step 1) must target anticipated failures, not arbitrary variation.
Core Process
Step 1: Define Dimensions
Dimensions are axes of variation specific to the application. The PM defines these — they know where failures happen.
Dimension 1: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Dimension 2: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Dimension 3: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Example for a customer support chatbot:
Query Type: what the user is asking about
Values: [billing, technical issue, account access, feature request, cancellation]
User Expertise: how technical the user is
Values: [non-technical, somewhat technical, power user]
Complexity: how many steps to resolve
Values: [single-step, multi-step, requires escalation]
Start with 3 dimensions. Add more only if initial traces reveal failure patterns along new axes.
Ask the PM: "What are the 3 most important ways inputs vary for your feature? Think about what makes some inputs harder than others."
Step 2: Draft 20 Tuples with the PM
A tuple is one combination of dimension values defining a specific test case. Present 20 draft tuples to the PM and iterate until they confirm the tuples reflect realistic scenarios.
(Query Type: Billing, User Expertise: Non-technical, Complexity: Multi-step)
(Query Type: Technical Issue, User Expertise: Power User, Complexity: Single-step)
(Query Type: Cancellation, User Expertise: Non-technical, Complexity: Requires Escalation)
The PM's domain knowledge is essential. They know which combinations actually occur and which are unrealistic.
Claude Code executes: Generate the initial 20 tuples ensuring coverage across dimension values. Present to PM for validation.
Step 3: Expand Tuples with LLM
Claude Code executes: Generate additional tuples using the PM-validated set as examples.
Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {application description}.
The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}
Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.
Step 4: Convert Tuples to Natural Language Queries
Separate step from tuple generation. Single-step generation (tuples + queries together) produces repetitive phrasing.
Claude Code executes: Convert each tuple to a realistic user query using a separate prompt per tuple.
We are generating synthetic user queries for a {application}.
{Brief description of what it does.}
Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}
Write a realistic query that a user might enter. The query should
reflect the specified characteristics.
Example: "{one of the PM-written examples}"
Now generate a new query.
Step 5: Filter for Quality
Review generated queries with the PM. Discard and regenerate when:
- Phrasing is awkward or unrealistic.
- Content doesn't match the tuple's intent.
- Queries are too similar to each other.
Claude Code executes: Rate realism using an LLM, discard below threshold, regenerate replacements.
Step 6: Run Through Pipeline
Execute all queries through the full LLM pipeline. Capture complete traces: input, all intermediate steps, tool calls, retrieved docs, final output.
Target: ~100 high-quality, diverse traces. This is a rough heuristic for reaching saturation.
Claude Code executes: Run the queries, capture traces, format for analysis. These traces feed directly into /upgrade-evals for error analysis.
Output
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TEST DATA GENERATED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Dimensions: [dim1], [dim2], [dim3]
Tuples generated: [count]
Queries generated: [count]
Queries after filtering: [count]
DIMENSION COVERAGE:
| Dimension | Values Covered | Gaps |
|-----------|---------------|------|
| [dim1] | [X/Y] | [any missing] |
| [dim2] | [X/Y] | [any missing] |
| [dim3] | [X/Y] | [any missing] |
NEXT STEPS:
- Run /upgrade-evals on these traces for error analysis
- Run /build-judge for failure modes that need automated evaluation
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
When Real Data Exists
When you have real queries available, don't just sample randomly. Use stratified sampling:
- Identify high-variance dimensions — read through queries and find ways they differ (length, topic, complexity, presence of constraints).
- Assign labels — for small sets, with the PM; for large sets, use K-means clustering on query embeddings.
- Sample from each group — ensures coverage across query types, not just the most common ones.
Use synthetic data to fill gaps in underrepresented query types.
Anti-Patterns
- Unstructured generation. "Give me test queries" without dimensions produces generic, repetitive, happy-path examples.
- Single-step generation. Generating tuples and queries in one prompt produces less diverse results.
- Arbitrary dimensions. Dimensions that don't target failure-prone regions waste test budget.
- Skipping PM review of tuples. Without the PM validating tuples, you can't judge realism.
- Synthetic data when no one can judge realism. If no one can tell whether a synthetic trace is realistic, use real data.
- Synthetic data for complex domain-specific content (legal filings, medical records) where LLMs miss structural nuance.
Methodology: Adapted from Hamel Husain's generate-synthetic-data skill (evals-skills, MIT license) PM adaptation: PM defines dimensions and validates realism, Claude Code handles generation and pipeline execution
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
four-fits
Find which fit is broken before you burn cash scaling. Brian Balfour's framework for validating sustainable growth readiness.
project-health
Deep-dive health check on a single Linear project. Produces assessment with 7 dimensions - On Track / At Risk / Stalled.
prompt-engineering
Expert prompt optimization system for building production-ready AI features. Use when users request help improving prompts, want to create system prompts, need prompt review/critique, ask for prompt optimization strategies, want to analyze prompt effectiveness, mention prompt engineering best practices, request prompt templates, or need guidance on structuring AI instructions. Also use when users provide prompts and want suggestions for improvement.
strategy-session
Your product soundboard. Work through product decisions conversationally - Claude gathers context, challenges assumptions, captures decisions, and creates Linear issues.
ai-debug
Diagnose why an AI feature is underperforming, hallucinating, or behaving inconsistently. Uses 4D audit to work backwards from symptoms to root cause.
shape-up
Shape work using the Shape Up methodology (Ryan Singer, Basecamp). Walk through the 4-step shaping process to create pitches ready for betting. Distinguishes between established product mode (fixed time, variable scope) and new product mode (looser constraints). Use when planning cycle work, writing pitches, or coaching PMs on shaping.
Didn't find tool you were looking for?