Agent skill

test-driven-development

Use when user explicitly requests test-driven development. Triggers: 'TDD', 'write tests first', 'red green refactor', 'test-first', 'start with the test'. Also invoked by develop and executing-plans for implementation tasks. NOT for: full feature work (use develop, which includes TDD internally).

Stars 5
Forks 2

Install this agent skill to your Project

npx add-skill https://github.com/axiomantic/spellbook/tree/main/skills/test-driven-development

SKILL.md

Test-Driven Development

Invariant Principles

  1. Failure Proves Testing - Test passing immediately proves nothing. Only watching failure proves test detects what it claims.
  2. Order Creates Trust - Tests-first answer "what should this do?" Tests-after answer "what does this do?" Fundamentally different questions.
  3. Minimal Sufficiency - Write exactly enough code to pass. YAGNI violations compound into untested complexity.
  4. Deletion Over Adaptation - Code written before tests is contaminated. Keeping "as reference" means testing after. Delete means delete.

Violating the letter of the rules is violating the spirit of the rules.

Inputs

Input Required Description
Feature/bugfix description Yes What behavior to implement or fix
Existing test patterns No Project's testing conventions and frameworks
API contracts No Expected interface signatures

Outputs

Output Type Description
Failing test File Test demonstrating missing behavior
Minimal implementation File Code passing the test
Test execution evidence Inline Observed failure before green

When to Use

Always: new features, bug fixes, refactoring, behavior changes.

Exceptions (ask human partner; no human available? default: apply TDD):

  • Throwaway prototypes
  • Generated code
  • Configuration files

Thinking "skip TDD just this once"? Stop. That's rationalization.

The Iron Law

NO PRODUCTION CODE WITHOUT FAILING TEST FIRST

Code before test? Delete. Start over. No "reference," no "adapting," no looking at it.

Reasoning Schema

Red-Green-Refactor

RED: Write Failing Test

One behavior. Clear name. Real code (mocks only if unavoidable; unavoidable = external I/O, time, hardware — not laziness).

const result = await retryOperation(operation);

expect(result).toBe('success'); expect(attempts).toBe(3); });

Clear name, tests real behavior, one thing
</Good>

<Bad>
```typescript
test('retry works', async () => {
  const mock = jest.fn()
    .mockRejectedValueOnce(new Error())
    .mockRejectedValueOnce(new Error())
    .mockResolvedValueOnce('success');
  await retryOperation(mock);
  expect(mock).toHaveBeenCalledTimes(3);
});

Vague name, tests mock not code </Bad>

Verify RED: Watch It Fail

MANDATORY. Never skip.

bash
npm test path/to/test.test.ts

Confirm:

  • Fails (not errors)
  • Failure message expected
  • Fails because feature missing (not typos)

Test passes? Testing existing behavior. Fix test. Test errors? Fix error, re-run until it fails correctly.

GREEN: Minimal Code

Simplest code to pass. No features, no refactoring, no "improvements."

Verify GREEN: Watch It Pass

MANDATORY.

bash
npm test path/to/test.test.ts

Confirm:

  • Test passes
  • Other tests still pass
  • Output pristine (no errors, warnings)

Test fails? Fix code, not test. Other tests fail? Fix now.

REFACTOR: Clean Up

After green only. Remove duplication, improve names, extract helpers. Keep tests green. Don't add behavior.

Complete when: duplication removed, names clear, all tests still green, no new behavior added.

Repeat

Next failing test for next feature. The cycle continues until all behavior is implemented.

Good Tests

Quality Good Bad
Minimal One thing. "and" in name? Split it. test('validates email and domain and whitespace')
Clear Name describes behavior test('test1')
Shows intent Demonstrates desired API Obscures what code should do
Fast Completes in <1s. No I/O or heavy fixtures in unit tests. 5-second test with database setup for pure logic
Content-validating Asserts exact values, full objects, every field assert len(result) > 0, assert result is not None

Assertion Quality

Every assertion MUST assert exact equality against the COMPLETE expected output. This applies to ALL output -- static, dynamic, or partially dynamic. There are no categories of output exempt from this rule.

python
# CORRECT: exact equality on complete static output
assert result == expected_complete_output

# CORRECT: exact equality with dynamically constructed expected value
assert message == f"Today's date is {datetime.date.today().isoformat()}"

# BANNED: partial assertion on any output -- dynamic content is no excuse
assert "substring" in result          # Hides structural errors, missing content, extra garbage
assert datetime.date.today().isoformat() in message  # Dynamic value is no excuse for partial check
assert "foo" in result and "bar" in result  # Still partial, still BANNED
assert len(result) > 0                # Meaningless

# BANNED: mock.ANY hides argument values
mock_fn.assert_called_with(mock.ANY, mock.ANY)

ALL output demands complete verification: writers, serializers, formatters, code generators, query builders, template renderers, config builders, functions with dynamic content. For multi-line output, use triple-quoted strings or dedent helpers. Output length is NEVER a justification for partial assertions.

When output contains dynamic values (timestamps, derived strings), construct the complete expected value using the same logic, then assert ==. Do not assert partial membership of the dynamic value.

For mock assertions: assert EVERY call with ALL args, verify call count. mock.ANY is BANNED -- construct the expected argument dynamically if needed.

Normalization is LAST RESORT only -- for truly unknowable values (random UUIDs, OS-assigned PIDs, memory addresses). Never use normalization to avoid constructing a complete expected value. </CRITICAL>

Tests must validate CONTENT, not just EXISTENCE. Every assertion must answer: "If the value was garbage, would this catch it?"

Reference: Read patterns/assertion-quality-standard.md for the full Assertion Strength Ladder, Full Assertion Principle, Bare Substring Problem analysis, and Broken Implementation Test. Every assertion must be Level 4+ on the ladder.

Rules

Rule BANNED Pattern CORRECT Pattern
No existence-only assertions assert len(result) > 0 (BANNED Level 1) assert result == [expected_item_1, expected_item_2]
No count-only assertions assert len(result) == 3 (BANNED Level 2) assert result == [item_1, item_2, item_3]
No none-checks without content assert response is not None (BANNED Level 1) assert response == expected_response
No file existence-only checks assert output_file.exists() (BANNED Level 1) assert output_file.read_text() == "expected content"
No mock.ANY ever assert_called_with(mock.ANY, mock.ANY) (BANNED -- proves nothing) assert_called_with("expected_arg", expected_obj) (construct dynamically if needed)
Assert every mock call, all args mock_fn.assert_called() or mock_fn.assert_called_once() (no argument check) mock_fn.assert_has_calls([call(arg1, arg2), ...]) + verify call_count
No partial checks on any output assert "key" in result (BANNED Level 2) assert result == {"key": "expected_value", ...} (construct dynamically for dynamic output)
Every field of every object assert result.status == "ok" (BANNED Level 3 without justification) assert result == ExpectedObject(status="ok", data=..., meta=...)
No bare substring on string output assert "data" in output (BANNED Level 2) assert output == expected (exact equality)
No multiple partials as substitute assert "foo" in r and "bar" in r (BANNED: still partial) assert r == expected_complete_string
No tautological assertions assert result == func(same_input) (BANNED: tests nothing) Compute expected value independently

Pychoir Matchers

Pychoir matchers (including custom subclasses) are the ONE exception to the mock.ANY ban. Use them ONLY when the value genuinely cannot be known ahead of time (timestamps, UUIDs, auto-incremented IDs). Each use requires a stated justification in a comment.

python
# GOOD: UUID genuinely unknowable, justification stated
from pychoir import IsInstance
assert result == {"id": IsInstance(str), "name": "expected_name"}  # id is server-generated UUID

# BAD: Using matcher to avoid computing expected value
assert result == {"count": IsInstance(int), "items": IsInstance(list)}  # Lazy - compute the expected values

ESCAPE Analysis (Mandatory Per Test Function)

Before marking ANY test function complete, fill in this template:

ESCAPE: [test_function_name]
  CLAIM: What does this test claim to verify?
  PATH:  What code actually executes?
  CHECK: What do the assertions verify?
  MUTATION: For each assertion, name the specific production code mutation it catches.
  ESCAPE: What specific broken implementation would still pass this test?
  IMPACT: What breaks in production if that broken implementation ships?

The MUTATION field is a forcing function: for each assertion in the test, you must name a specific, plausible production code change that would cause that assertion to fail. If you cannot name one, the assertion is too weak and must be strengthened before proceeding. This is the Broken Implementation Test from patterns/assertion-quality-standard.md.

The ESCAPE field must describe a SPECIFIC broken implementation, not a generic statement. "Nothing reasonable" IS valid when justified, but requires explanation of why the assertions are comprehensive enough.

Red flags in ESCAPE analysis:

  • ESCAPE field says "none" or "nothing" without justification
  • ESCAPE field describes something the test DOES catch (means you didn't think hard enough)
  • ESCAPE field is copy-pasted between tests (each test has unique escape paths)

Evidence Requirements

Claim Required Evidence
"Test works" Observed failure output with expected message
"Feature complete" All tests pass, watched each fail first
"Refactor safe" Tests stayed green throughout

Anti-Patterns

  • Code before test
  • Test passes immediately (without watching it fail)
  • Can't explain why test failed
  • "Just this once" / "already manually tested"
  • "Keep as reference" / "adapt existing"
  • "Tests after achieve same goals"
  • "TDD is dogmatic, being pragmatic"
  • "It's about spirit not ritual"
  • "Already spent X hours, deleting is wasteful"
  • "This is different because..."
  • Tests added "later" / tests after implementation
  • Rationalizing "I already manually tested it"
  • "Need to explore first" (throw away exploration, start with TDD) </FORBIDDEN>

Common Rationalizations

Excuse Reality
"Too simple to test" Simple code breaks. Test takes 30 seconds.
"I'll test after" Tests passing immediately prove nothing.
"Tests after achieve same goals" Tests-after = "what does this do?" Tests-first = "what should this do?" You test what you built, not what's required. You verify remembered edge cases, not discovered ones.
"Already manually tested" Ad-hoc is not systematic. No record, can't re-run. Forgot cases are the ones that break under pressure.
"Deleting X hours is wasteful" Sunk cost fallacy. Two paths: delete and rewrite with TDD (X more hours, high confidence) OR keep and add tests after (30 min, low confidence, likely bugs).
"Keep as reference, write tests first" You'll adapt it. That's testing after. Delete means delete.
"Need to explore first" Fine. Throw away exploration, start with TDD.
"Test hard = design unclear" Listen to test. Hard to test = hard to use.
"TDD will slow me down" TDD is faster than debugging. Finds bugs before commit, prevents regressions, documents behavior, enables refactoring.
"Manual test faster" Manual doesn't prove edge cases. You'll re-test every change.
"Existing code has no tests" You're improving it. Add tests for existing code.

Self-Check

Before marking complete:

  • Every function has test
  • Watched each test fail before implementing
  • Failed for expected reason (feature missing, not typo)
  • Wrote minimal code to pass
  • All tests pass, output pristine
  • Tests use real code (mocks only if unavoidable)
  • Edge cases and errors covered
  • Every assertion validates CONTENT, not just existence/count
  • Every assertion is Level 4+ on the Assertion Strength Ladder (patterns/assertion-quality-standard.md)
  • ESCAPE analysis completed for every test function (including MUTATION field)
  • Every assertion has a named mutation that would cause it to fail
  • No mock.ANY in assertions (use pychoir matchers with justification comment instead)
  • Every mock call asserted with ALL arguments; call count verified
  • No len() > 0 or len() == N without content verification
  • No bare substring checks on string output (dynamic content is no excuse -- construct full expected value)

If ANY unchecked: Skipped TDD. Start over.

When Stuck

Problem Solution
Don't know how to test Write wished-for API. Assertion first. Ask human.
Test too complicated Design too complicated. Simplify interface.
Must mock everything Code too coupled. Dependency injection.
Test setup huge Extract helpers. Still complex? Simplify design.

Bug Fix Pattern

Bug: Empty email accepted

RED

typescript
test('rejects empty email', async () => {
  const result = await submitForm({ email: '' });
  expect(result.error).toBe('Email required');
});

Verify RED

bash
$ npm test
FAIL: expected 'Email required', got undefined

GREEN

typescript
function submitForm(data: FormData) {
  if (!data.email?.trim()) {
    return { error: 'Email required' };
  }
  // ...existing logic
}

Verify GREEN

bash
$ npm test
PASS

REFACTOR Extract validation for multiple fields if needed.

Never fix bugs without a test.

Testing Anti-Patterns

When adding mocks or test utilities, avoid common pitfalls:

Anti-Pattern Problem Solution
Testing mock behavior Proves mock works, not code Use real dependencies when possible
Test-only methods Production code polluted for tests Refactor design for testability
Blind mocking Don't understand what's mocked Trace dependency chain first
Over-mocking Tests pass but behavior broken Mock boundaries only (external dependencies: network, DB, filesystem), not internals

Test Speed & Scope

Fast tests enable tight red-green-refactor cycles. Slow tests break flow.

  • Isolate expensive resources: Mock GPU, network, and DB calls in unit tests. Real resources belong in integration tests only.
  • Smallest possible inputs: 4x4 matrices, not 1024x1024. Save large inputs for performance/integration tests.
  • Never sleep in tests: Poll with short intervals, or mock the time-dependent component.
  • Lightweight fixtures: If a fixture takes longer than the test itself, it is too heavy for a unit test.

Apply marks proactively when writing new tests. A test that calls a GPU kernel is a GPU test even if it is fast today. Common marks: slow, gpu/hardware, network/external, integration, smoke.

Scope test runs to changes: if src/auth/login.py changed, run tests/test_login.py, not the entire suite. Run the full suite once at the end of a work unit, not after every edit.

Final Rule

Production code -> test exists and failed first
Otherwise -> not TDD

No exceptions without your human partner's permission.

<FINAL_EMPHASIS> The test must fail first. You must watch it fail. The code must be minimal. There are no shortcuts. Every rationalization is a trap. Delete code written before tests. Start over with TDD. </FINAL_EMPHASIS>

Expand your agent's capabilities with these related and highly-rated skills.

axiomantic/spellbook

spellbook-auditing

Meta-audit skill for spellbook development. Spawns parallel subagents to factcheck docs, optimize instructions, find token savings, and identify MCP candidates. Produces actionable report.

5 2
Explore
axiomantic/spellbook

documentation-updates

Use after modifying library skills, library commands, or agents to ensure CHANGELOG, README, and docs are updated

5 2
Explore
axiomantic/spellbook

project-encyclopedia

[DEPRECATED] Use project-level AGENTS.md files instead. Previously used for first-session codebase onboarding and persistent glossary creation.

5 2
Explore
axiomantic/spellbook

reviewing-impl-plans

Use when reviewing implementation plans before execution. Triggers: 'is this plan solid', 'review the plan', 'check before I start building', 'anything missing from this plan', 'will this plan work', 'audit the implementation plan'. NOT for: reviewing design documents (use reviewing-design-docs) or creating plans (use writing-plans).

5 2
Explore
axiomantic/spellbook

session-resume

Session resume protocol and session repairs handling. Loaded when spellbook_session_init returns resume_available: true, or when session_init returns a repairs array. Triggers: 'resume', 'continue', 'where were we', session resume, session repairs.

5 2
Explore
axiomantic/spellbook

brainstorming

Use when exploring design approaches, generating ideas, or making architectural decisions. Triggers: 'explore options', 'what are the tradeoffs', 'how should I approach', 'let's think through', 'sketch out an approach', 'I need ideas for', 'how would you structure', 'what are my options'. Also invoked by develop when design decisions are needed.

5 2
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results