Agent skill
nw-sd-framework
4-step system design framework with back-of-envelope estimation, scaling ladder, and common pitfalls
Install this agent skill to your Project
npx add-skill https://github.com/nWave-ai/nWave/tree/main/plugins/nw/skills/nw-sd-framework
SKILL.md
System Design Framework
The 4-Step Process
Every system design follows this structure. Skipping steps is the top mistake.
Step 1: Understand the Problem and Establish Design Scope (3-10 min)
Narrow an impossibly broad question into a tractable problem.
Ask about: users and scale | most important features | read/write ratio | non-functional requirements (latency, availability, consistency) | existing infrastructure | special constraints (mobile-first, offline, regulatory)
Produce: functional requirements (3-5 bullets) | non-functional requirements (scale, latency, availability, consistency model) | capacity estimation (QPS, storage, bandwidth)
Red flags if skipped: designing a system nobody asked for | over-engineering for imaginary scale | missing critical constraints (GDPR, real-time)
Step 2: Propose High-Level Design and Get Buy-In (10-15 min)
Sketch the big picture. Validate before diving deep.
Do: draw architecture diagram (clients, servers, databases, caches, queues) | define API contract (REST/GraphQL/gRPC -- key endpoints) | design data model (entities, relationships, access patterns) | walk through 1-2 core use cases end-to-end | get buy-in: "Does this make sense before I go deeper?"
API patterns: RESTful for CRUD-heavy | GraphQL for flexible client queries | gRPC for internal service-to-service | WebSocket/SSE for real-time
Data model: SQL vs NoSQL based on access patterns, not hype | denormalization trade-offs | partitioning key selection (directly impacts scalability)
Step 3: Design Deep Dive (10-25 min)
Go deep on 2-3 components.
Choose: most technically challenging | most interesting trade-offs | bottleneck components (highest load, most failure-prone)
Depth means: specific algorithms (consistent hashing, Bloom filters) | failure modes and handling | scaling strategy per component | data flow with edge cases | monitoring and operational concerns
Step 4: Wrap Up (3-5 min)
Cover: summarize design in 2-3 sentences | identify known bottlenecks | what you'd improve with more time | operational concerns (monitoring, alerting, deployment) | future enhancements
Avoid: introducing entirely new components at this stage | second-guessing your design
Back-of-Envelope Estimation
Powers of 2 Reference
| Power | Value | Meaning |
|---|---|---|
| 10 | 1 Thousand | 1 KB |
| 20 | 1 Million | 1 MB |
| 30 | 1 Billion | 1 GB |
| 40 | 1 Trillion | 1 TB |
| 50 | 1 Quadrillion | 1 PB |
Latency Numbers Every Engineer Should Know
| Operation | Latency |
|---|---|
| L1 cache reference | 0.5 ns |
| L2 cache reference | 7 ns |
| Main memory reference | 100 ns |
| Compress 1KB (Zippy) | 10 us |
| Send 2KB over 1 Gbps | 20 us |
| Read 1 MB from memory | 250 us |
| Datacenter round trip | 500 us |
| Disk seek | 10 ms |
| Read 1 MB from network | 10 ms |
| Read 1 MB from disk | 30 ms |
| CA to Netherlands round trip | 150 ms |
Key takeaways: memory fast, disk slow -- cache aggressively | compress before network send | inter-datacenter trips expensive -- minimize cross-region calls
Common Estimation Patterns
DAU to QPS: QPS = DAU * actions_per_user / 86400 | Peak QPS = QPS * 2 (or *3 for spiky)
Storage: daily = DAU * actions * avg_size | yearly = daily * 365 | 5-year = yearly * 5
Bandwidth: QPS * average_response_size
Servers: Peak QPS / QPS_per_server where CPU-bound ~hundreds | IO-bound with cache ~thousands | static content ~tens of thousands
Estimation Example: Twitter-like Service
150M DAU, 2 tweets/day, 10 reads/day
Write QPS = 150M * 2 / 86400 ~ 3,500
Read QPS = 150M * 10 / 86400 ~ 17,000; Peak ~ 50,000
Storage: 300M tweets * 1KB + 30M media * 500KB ~ 15.3 TB/day
Scaling Ladder
Each step solves a specific bottleneck. Never introduce a component without articulating which bottleneck it addresses.
- Load balancer -- distribute traffic across web servers
- Database replication -- master-slave for read scaling
- Cache layer -- reduce database load (Redis/Memcached)
- CDN -- serve static content from edge
- Stateless web tier -- move session state to shared store
- Database sharding -- horizontal partitioning for write scaling
- Message queue -- decouple components, handle spikes
- Logging, metrics, monitoring -- observability at scale
- Multiple data centers -- geographic redundancy and latency reduction
Common Pitfalls
- Jumping to solutions -- design before understanding requirements
- Over-engineering -- adding components for imaginary scale
- Ignoring trade-offs -- every choice has a cost; name it
- SPOF blindness -- always ask "what if this dies?"
- Neglecting data -- the data model drives everything
- Forgetting operations -- a system you can't monitor is one you can't run
- Not doing math -- gut feelings are wrong; estimates keep you honest
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
nw-research
Gathers knowledge from web and files, cross-references across multiple sources, and produces cited research documents. Use when investigating technologies, patterns, or decisions that need evidence backing.
nw-distill
Acceptance test creation methodology for the DISTILL wave. Domain knowledge for the acceptance designer agent: port-to-port principle, prior wave reading, wave-decision reconciliation, graceful degradation, and document back-propagation.
nw-review-output-format
YAML output format and approval criteria for platform design reviews. Load when generating review feedback.
nw-ddd-tactical
Tactical DDD — aggregate design rules, entities, value objects, domain events, repositories, domain services, and anti-pattern detection
nw-infrastructure-and-observability
Infrastructure as Code patterns (Terraform, Kubernetes), observability design (SLOs, metrics, alerting, dashboards), and pipeline security stages. Load when designing infrastructure, observability, or security scanning.
nw-par-critique-dimensions
Platform design review critique dimensions and severity levels. Load when reviewing CI/CD pipelines, infrastructure, deployment strategies, observability, or security designs.
Didn't find tool you were looking for?