Agent skill
nw-formal-verification-tlaplus
TLA+ and PlusCal for specifying distributed system invariants. Decision heuristics for when formal verification adds value, key patterns, state explosion management, and alternatives comparison.
Install this agent skill to your Project
npx add-skill https://github.com/nWave-ai/nWave/tree/main/nWave/skills/nw-formal-verification-tlaplus
SKILL.md
Formal Verification with TLA+
When to Recommend Formal Verification
Decision Tree
Is the system distributed or concurrent?
|
+-- No --> Complex state machine with high failure cost?
| +-- No --> NOT cost-effective. Use property-based testing.
| +-- Yes --> CONSIDER TLA+
|
+-- Yes --> Consensus, coordination, or distributed transactions?
| +-- Yes --> RECOMMEND TLA+
| +-- No --> Could concurrency bug cause data loss or safety issues?
| +-- Yes --> RECOMMEND TLA+
| +-- No --> OFFER as option
Strong Indicators (Recommend)
| Domain | Why TLA+ Adds Value | Evidence |
|---|---|---|
| Distributed consensus (Paxos, Raft) | Subtle interleaving bugs in leader election | Raft TLA+ spec ~400 lines, found implementation bugs |
| Financial distributed transactions | Atomicity violations cause monetary loss | AWS DynamoDB replication verified |
| Leader election, distributed locking | Split-brain, deadlock, stale-lock | AWS lock manager verified |
| Eventual consistency / CRDTs | Convergence proofs required | TLA+ CRDT framework verifies SEC |
| Safety-critical state machines | Regulatory requirements | DO-178C, CENELEC recognize formal methods |
| Multi-party coordination (sagas, 2PC) | Compensation ordering, partial failure | 2PC is canonical TLA+ example |
| Data replication protocols | Ordering, consistency under failure | Elasticsearch, MongoDB, Cosmos DB verified |
When NOT to Use
- Simple CRUD (bugs are in implementation, not design)
- Single-process without complex state machines
- Prototypes/MVPs (design will change before verification completes)
- Performance optimization (TLA+ models correctness, not performance)
Cost-Benefit Reference
- Learning curve: 2-3 weeks to useful results (AWS engineers, all levels)
- Typical spec effort: 2-4 weeks part-time for a distributed protocol
- ROI highest when: bug cost is high, system is long-lived, protocol is novel, concurrency testing is impractical
Core Concepts for Architects
What TLA+ Specifies
TLA+ describes what a system should do (allowed behaviors), not how to implement it. Specifications are mathematical objects checked for correctness before any code exists.
Safety vs. Liveness
| Property Type | Meaning | Expression | Example |
|---|---|---|---|
| Safety | Nothing bad happens | Invariant: predicate true in every reachable state | "Two processes never hold same lock" |
| Liveness | Something good eventually happens | Temporal: <> (eventually), []<> (infinitely often) |
"Every request eventually gets response" |
Safety violations produce counterexample traces (the debugging artifact). Liveness requires fairness conditions.
PlusCal vs. Raw TLA+
PlusCal compiles to TLA+ with programming-like syntax. Start with PlusCal for first 2-3 specs, then learn raw TLA+ for cases PlusCal cannot express.
Key PlusCal constructs: variables (state) | labels (atomic action boundaries) | either/or (nondeterministic choice) | await (blocking) | process \in 1..N (concurrent processes) | fair process (weak fairness)
Labels define concurrency granularity: everything between two labels is one atomic step. Two processes interleave only at label boundaries.
State Explosion Management
State space grows exponentially: (states per node)^(nodes) x (message permutations).
Containment Strategies
| Strategy | Technique | Impact |
|---|---|---|
| Bound parameters | Start with 2-3 nodes, 2-4 messages | Most bugs appear at small N |
| Symmetry reduction | SYMMETRY Permutations(Nodes) |
Up to N! reduction |
| Reduce labels | Merge labels where fine-grained atomicity unnecessary | Orders of magnitude |
| State constraints | CONSTRAINT Len(log[n]) < MaxLogLength |
Prune uninteresting states |
| Abstraction | Model protocol not implementation (TCP -> message set) | Dramatic reduction |
| Decomposition | Multiple focused specs, not one monolith | Each independently checkable |
| Progressive refinement | 2 nodes -> 3 nodes -> add failures -> add liveness | Incremental verification |
| Simulation mode | java -jar tla2tools.jar -simulate -depth 100 |
Trades completeness for speed |
Memory and Time Budgets
| Unique States | Expected Time | Memory | Approach |
|---|---|---|---|
| < 10K | Seconds | < 1 GB | Exhaustive, single thread |
| 10K - 1M | Minutes | 1-4 GB | Exhaustive, -workers auto |
| 1M - 100M | Hours | 4-32 GB | Exhaustive with constraints |
| 100M - 1B | Days | 32-64 GB | Large instance or simulation |
| > 1B | Weeks | 60+ GB | Simulation, TLAPS, or decompose |
Estimation Before Running
- Count distinct variable values in model
- Multiply domains together for baseline
- Start TLC with smallest parameters, observe state count
- Extrapolate: doubling a parameter typically squares or cubes the space
Key Specification Patterns
Two-Phase Commit (2PC)
- Variables: rmState, tmState, tmPrepared, msgs
- Safety: no RM commits while another aborts (
Consistency) - State space: 3 RMs ~718 states, 5 RMs ~21,488 states
- Common mistake: not modeling RM spontaneous abort or unreliable network
Distributed Consensus (Raft)
- Variables: currentTerm, votedFor, log, state, votesGranted, msgs
- Safety: at most one leader per term (
ElectionSafety) - Safety: logs with same index+term are identical (
LogMatching) - State space: 3 nodes, MaxTerm=2 ~10K-100K states
Saga (Compensating Transactions)
- Variables: stepState, sagaState, compensateIdx
- Safety: steps execute in order, compensations in reverse (
OrderInvariant) - Safety: no completed steps remain after abort (
CompensationComplete) - Common mistake: not enforcing reverse compensation order
Distributed Lock with Lease
- Variables: lockHolder, leaseExpiry, clock, nodeState
- Safety: at most one holder (
MutualExclusion) - Models crash (node loses awareness) and lease expiry
- Common mistake: not distinguishing server-side lock state from node belief
CRDT Convergence (G-Counter)
- Variables: counters (vector per node)
- Safety: counters monotonically non-decreasing
- Liveness: all nodes eventually converge after merge
- Common mistake: merge not commutative, associative, and idempotent
Alternatives Comparison
| Tool | Best For | Learning | Distributed Systems | Temporal Properties |
|---|---|---|---|---|
| TLA+/PlusCal | Distributed protocols, consensus | 2-3 weeks | Excellent | Native |
| Alloy | Data models, structural properties | 1-2 weeks | Adequate | Limited (Alloy 6) |
| Property-Based Testing | Implementation correctness | Hours-days | With stateful testing | None |
| XState/Statecharts | UI workflows, single-process | Days | Not applicable | None |
| Session Types/Scribble | Communication patterns | Moderate | Good (message patterns) | Implicit |
| TLAPS (proofs) | Critical certification | Months | Excellent | Full |
Combined Workflow (TLA+ + PBT)
- Write TLA+ spec during DESIGN wave; identify invariants
- Model-check with TLC to verify design
- Implement code during DELIVER wave
- Reuse TLA+ invariants as PBT properties
- PBT verifies implementation conforms to verified design
Architecture Decision Record Template
## Decision: Use TLA+ for [Component Name]
### Context
[Component] implements [protocol] with [N] participants
and [concurrency/distribution characteristics].
### Problem
Informal reasoning about [failure/interleaving scenario]
is insufficient because [reason].
### Decision
Formally specify [component] in TLA+/PlusCal and verify:
- Safety: [specific invariants]
- Liveness: [specific temporal properties]
### Model Parameters
- Nodes: [2-3 for initial verification]
- Messages: [bounded to N]
- Failure modes: [crash, partition, message loss]
### Estimated Effort
- Specification: [1-2 weeks]
- Model checking: [hours to days]
### Not Modeling (out of scope)
- [performance, serialization, UI, etc.]
Architect's Checklist
- Identify components with concurrency, distribution, or complex state machines
- Determine safety properties (what must NEVER happen)
- Determine liveness properties (what must EVENTUALLY happen)
- Estimate model parameters and state space
- Assess cost-effectiveness vs. alternatives (decision tree above)
- Document decision in ADR with specific invariants and properties
- Scope verification: focused specs per subsystem, not one monolith
Industry Precedent
| Organization | Systems Verified | Outcome |
|---|---|---|
| AWS | 14 projects across 10 systems (DynamoDB, S3, EBS) | Found subtle bugs in every system; management actively encourages adoption |
| Azure Cosmos DB | All 5 consistency levels | Specs became authoritative reference, replaced ambiguous docs |
| MongoDB | Replication, reconfiguration, transactions | Logless reconfig deployed since 4.4, no protocol bugs |
| Elasticsearch | Cluster coordination, data replication | 4 TLA+ specs + Isabelle proofs, open-sourced |
| CockroachDB | Transaction layer | TLA+ specs in repository under docs/tla-plus/ |
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
nw-research
Gathers knowledge from web and files, cross-references across multiple sources, and produces cited research documents. Use when investigating technologies, patterns, or decisions that need evidence backing.
nw-distill
Acceptance test creation methodology for the DISTILL wave. Domain knowledge for the acceptance designer agent: port-to-port principle, prior wave reading, wave-decision reconciliation, graceful degradation, and document back-propagation.
nw-review-output-format
YAML output format and approval criteria for platform design reviews. Load when generating review feedback.
nw-ddd-tactical
Tactical DDD — aggregate design rules, entities, value objects, domain events, repositories, domain services, and anti-pattern detection
nw-infrastructure-and-observability
Infrastructure as Code patterns (Terraform, Kubernetes), observability design (SLOs, metrics, alerting, dashboards), and pipeline security stages. Load when designing infrastructure, observability, or security scanning.
nw-par-critique-dimensions
Platform design review critique dimensions and severity levels. Load when reviewing CI/CD pipelines, infrastructure, deployment strategies, observability, or security designs.
Didn't find tool you were looking for?