Agent skill
run-test-plan
Execute YAML test plan, stop on first failure, output rich debug prompt
Install this agent skill to your Project
npx add-skill https://github.com/existential-birds/beagle/tree/main/plugins/beagle-testing/skills/run-test-plan
SKILL.md
Run Test Plan
Execute a YAML test plan, run setup commands, health checks, and each test sequentially. Stop on first failure with rich debug output.
Prerequisites
- agent-browser skill: Browser tests require the
agent-browser:agent-browserskill to be available
Arguments
--plan <path>: Path to test plan (default:docs/testing/test-plan.yaml)--skip-setup: Skip setup commands and health checks (for re-running after failure)
Step 1: Parse Test Plan
Read and validate the test plan:
# Check file exists
ls docs/testing/test-plan.yaml || { echo "Error: Test plan not found"; exit 1; }
# Validate YAML
python3 -c "import yaml; yaml.safe_load(open('docs/testing/test-plan.yaml'))" || { echo "Error: Invalid YAML"; exit 1; }
Extract from the YAML:
setup.commands: List of setup commandssetup.health_checks: List of URLs to polltests: Array of test cases
Step 2: Run Setup Commands (unless --skip-setup)
Execute setup commands sequentially:
# For each command in setup.commands
<command> || { echo "Setup failed: <command>"; exit 1; }
For commands that start services (e.g., pnpm run dev, docker-compose up -d):
- Run in background
- Capture PID for cleanup
- Continue to health checks
# Start dev servers in background
nohup <dev_command> > .beagle/dev-server.log 2>&1 &
echo $! > .beagle/dev-server.pid
Step 3: Run Health Checks
Poll each health check URL until healthy or timeout:
# For each health_check
timeout=<health_check.timeout or 30>
url=<health_check.url>
elapsed=0
while [ $elapsed -lt $timeout ]; do
if curl -s -o /dev/null -w "%{http_code}" "$url" | grep -qE "^(200|301|302)"; then
echo "✓ Health check passed: $url"
break
fi
sleep 2
elapsed=$((elapsed + 2))
done
if [ $elapsed -ge $timeout ]; then
echo "✗ Health check timeout: $url"
exit 1
fi
Step 4: Execute Tests Sequentially
For each test in the plan:
4a. Log Test Start
## Running: TC-XX - <test.name>
Context: <test.context>
4b. Execute Steps
For each step in test.steps:
curl actions:
curl -X <method> \
-H "Content-Type: application/json" \
<additional headers> \
-d '<body>' \
"<url>" \
-o response.json \
-w "%{http_code}" > status_code.txt
# Capture response for evaluation
cat response.json
cat status_code.txt
agent-browser CLI actions:
Execute browser steps as agent-browser CLI commands via Bash. Each run: step in the test plan is a CLI command:
# Navigate
agent-browser open <url>
# Snapshot interactive elements (always do before interacting)
agent-browser snapshot -i
# Interact using refs from snapshot output (@e1, @e2, etc.)
agent-browser fill @<ref> "<value>"
agent-browser click @<ref>
# Wait for conditions
agent-browser wait --url "<pattern>"
agent-browser wait --text "<text>"
agent-browser wait --load networkidle
# Capture evidence
agent-browser screenshot docs/testing/evidence/<test.id>.png
Important: Always run agent-browser snapshot -i before interacting with elements to get valid refs, and re-snapshot after navigation or significant DOM changes.
Save screenshots to docs/testing/evidence/<test.id>.png
4c. Evaluate Result
Using agent reasoning, compare actual outcome against test.expected:
- Read the expected behavior description
- Compare with actual response/screenshot
- Determine PASS or FAIL
4d. On PASS
✓ TC-XX PASSED: <test.name>
Continue to next test.
4e. On FAIL
Stop immediately. Go to Step 6.
Step 5: On All Tests Pass
## Test Results: ALL PASSED
| ID | Name | Result |
|----|------|--------|
| TC-01 | <name> | ✓ PASS |
| TC-02 | <name> | ✓ PASS |
| ... | ... | ... |
**Total:** N/N tests passed
### Evidence
Screenshots saved to `docs/testing/evidence/`
### Cleanup
Stopping background services...
Clean up:
# Kill dev servers
if [ -f .beagle/dev-server.pid ]; then
kill $(cat .beagle/dev-server.pid) 2>/dev/null
rm .beagle/dev-server.pid
fi
Step 6: On Failure - Generate Debug Prompt
When a test fails, generate rich debug output:
6a. Gather Context
# Get changed files relevant to the failure
git diff --name-only $(git merge-base HEAD origin/main)..HEAD
# Get recent changes in files mentioned in test.context
git diff $(git merge-base HEAD origin/main)..HEAD -- <relevant_files>
6b. Output Debug Report
## Test Failure: TC-XX - <test.name>
### What Failed
**Test:** <test.name>
**Expected:**
<test.expected>
**Actual:**
<Describe what actually happened - response code, error message, screenshot description>
### Relevant Changes in This PR
<For each file mentioned in test.context or related to the failure:>
- `<file>` (lines X-Y) - <brief description of changes>
### Evidence
<If screenshot exists:>
- Screenshot: `docs/testing/evidence/<test.id>.png`
<If API response:>
- Status code: <code>
- Response body:
```json
<response>
Error Details
Suggested Investigation
<Based on the error, suggest 2-3 specific things to check:>
- <Third thing about environment/setup>
Debug Session Prompt
Copy this to start a new Claude session:
I'm debugging a test failure in branch <branch>.
Test: <test.name> Error:
Relevant files: <List changed files related to this test>
Help me investigate why .
### 6c. Preserve Evidence
```bash
# Ensure evidence directory exists
mkdir -p docs/testing/evidence
# Save failure context
cat > docs/testing/evidence/<test.id>-failure.md << 'EOF'
# Failure Report: <test.id>
<Full debug report content>
EOF
6d. Cleanup and Exit
# Kill dev servers
if [ -f .beagle/dev-server.pid ]; then
kill $(cat .beagle/dev-server.pid) 2>/dev/null
rm .beagle/dev-server.pid
fi
Test Results Summary Table
Always output a summary table showing progress:
## Test Results
| ID | Name | Result |
|----|------|--------|
| TC-01 | <name> | ✓ PASS |
| TC-02 | <name> | ✗ FAIL |
| TC-03 | <name> | - SKIP |
**Passed:** 1/3
**Failed:** TC-02
Tests after a failure are marked as SKIP (not executed).
Verification
Before completing:
# Verify evidence directory exists
ls -la docs/testing/evidence/
# List captured evidence
ls docs/testing/evidence/*.png docs/testing/evidence/*.md 2>/dev/null
Verification Checklist:
- Setup commands executed successfully
- Health checks passed before test execution
- Each executed test has recorded result
- Evidence captured in
docs/testing/evidence/ - On failure: debug prompt includes expected vs actual
- On failure: relevant PR changes listed
- Background processes cleaned up
Rules
- Stop on first test failure (do not continue to other tests)
- Always capture evidence (screenshots, responses)
- Include file:line references in debug prompts when possible
- Use
--skip-setupflag to re-run after fixing issues - Never hardcode secrets - use environment variables
- Clean up background processes even on failure
- Preserve failure evidence for debugging
- Make debug prompts copy-paste ready for new sessions
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
review-python
Comprehensive Python/FastAPI backend code review with optional parallel agents
review-verification-protocol
Mandatory verification steps for all code reviews to reduce false positives. Load this skill before reporting ANY code review findings.
sqlalchemy-code-review
Reviews SQLAlchemy code for session management, relationships, N+1 queries, and migration patterns. Use when reviewing SQLAlchemy 2.0 code, checking session lifecycle, relationship() usage, or Alembic migrations.
fastapi-code-review
Reviews FastAPI code for routing patterns, dependency injection, validation, and async handlers. Use when reviewing FastAPI apps, checking APIRouter setup, Depends() usage, or response models.
pytest-code-review
Reviews pytest test code for async patterns, fixtures, parametrize, and mocking. Use when reviewing test_*.py files, checking async test functions, fixture usage, or mock patterns.
postgres-code-review
Reviews PostgreSQL code for indexing strategies, JSONB operations, connection pooling, and transaction safety. Use when reviewing SQL queries, database schemas, JSONB usage, or connection management.
Didn't find tool you were looking for?