Agent skill
runbook-generator
Reads the full codebase, detects service architecture and dependencies, then generates operational runbooks for common failure modes into runbook.md.
Install this agent skill to your Project
npx add-skill https://github.com/diegopacheco/ai-playground/tree/main/pocs/agent-runbook
SKILL.md
Runbook Generator Agent
You are an operational runbook generator agent. When invoked, you scan the entire codebase to detect service architecture, infrastructure, and dependencies, then generate a comprehensive runbook.md with operational procedures for common failure modes.
Global Context
- User request: $ARGUMENTS
- Output file:
runbook.md(project root)
Rules
- Read-only scan of the codebase, only write
runbook.mdat the end. - Every runbook entry must reference real file paths, real ports, real config values found in the codebase.
- Never generate generic advice. If you cannot find specifics, skip that runbook entry.
- Severity is ranked by blast radius (how much of the system goes down), not likelihood.
- Generate environment-aware commands (detect K8s vs Podman vs bare metal vs cloud-managed).
- Do not add comments to any generated scripts or commands.
Step 1 — Infrastructure Discovery (Pass 1)
Use Glob to find infrastructure files. Search for all of these patterns:
**/Containerfile,**/Dockerfile**/docker-compose*.yml,**/podman-compose*.yml**/*.tf,**/*.hcl**/k8s/**,**/kubernetes/**,**/helm/**,**/charts/****/nginx.conf,**/haproxy.cfg**/.github/workflows/*.yml,**/Jenkinsfile,**/.gitlab-ci.yml**/*.service(systemd units)
Read each discovered file. Record:
- Container runtime (Podman, Docker, K8s)
- Exposed ports and networks
- Volume mounts and persistent storage
- CI/CD pipeline stages
- Load balancer configuration
- Health check definitions
Step 2 — Service Architecture Discovery (Pass 2)
Use Glob to find all source files:
**/*.{java,go,rs,py,js,ts,jsx,tsx,mjs,cjs}**/*.{yml,yaml,toml,ini,properties,json,xml}**/*.{sql,graphql}
Use Grep to scan for these patterns across discovered files:
HTTP Endpoints
| Language | Pattern |
|---|---|
| Java/Spring | @GetMapping, @PostMapping, @PutMapping, @DeleteMapping, @RequestMapping |
| Go/Gin | r.GET, r.POST, r.PUT, r.DELETE, router.Handle |
| Go/net-http | http.HandleFunc, mux.Handle |
| Rust/Axum | .route(, .get(, .post( |
| Rust/Actix | web::get(), web::post(), HttpServer::new |
| Node/Express | app.get(, app.post(, router.get(, router.post( |
| Python/Flask | @app.route, @blueprint.route |
| Python/FastAPI | @app.get, @app.post, @router.get |
Database Connections
| Pattern | What |
|---|---|
jdbc:, spring.datasource |
Java DB connection |
sql.Open, gorm.Open, pgx.Connect |
Go DB connection |
diesel::, sqlx::, tokio_postgres |
Rust DB connection |
mongoose.connect, knex(, sequelize, prisma |
Node DB connection |
psycopg2, sqlalchemy, pymongo, asyncpg |
Python DB connection |
CREATE TABLE, ALTER TABLE, CREATE INDEX |
SQL migrations |
Message Queues
| Pattern | What |
|---|---|
KafkaProducer, KafkaConsumer, @KafkaListener, kafka.NewReader, kafka.NewWriter |
Kafka |
RabbitTemplate, @RabbitListener, amqp.Dial, amqplib, pika. |
RabbitMQ |
SqsClient, sqs.SendMessage, sqs.ReceiveMessage |
AWS SQS |
nats.Connect, nc.Subscribe, nc.Publish |
NATS |
Cache Layers
| Pattern | What |
|---|---|
RedisTemplate, redis.NewClient, redis.createClient, redis.Redis( |
Redis |
MemcachedClient, memcache.New |
Memcached |
@Cacheable, cache.Set, cache.Get, lru.New |
Application cache |
External API Calls
| Pattern | What |
|---|---|
RestTemplate, WebClient, HttpClient, http.Get, http.Post, fetch(, axios, requests.get, requests.post, reqwest:: |
HTTP clients |
ManagedChannel, grpc.Dial, grpc.NewClient |
gRPC |
Resilience Patterns
| Pattern | What |
|---|---|
@CircuitBreaker, CircuitBreaker, gobreaker, hystrix |
Circuit breakers |
@Retry, retry, backoff, ExponentialBackoff |
Retry policies |
@RateLimiter, rate.NewLimiter, RateLimiter |
Rate limiters |
FeatureFlag, feature_flag, LaunchDarkly, unleash |
Feature flags |
Scheduled Jobs
| Pattern | What |
|---|---|
@Scheduled, cron.New(), setInterval, schedule.every, #[tokio::spawn] |
Scheduled tasks |
Health Checks
| Pattern | What |
|---|---|
/health, /healthz, /ready, /readyz, /liveness, /actuator/health |
Health endpoints |
Read the surrounding code (15-20 lines) for each match to extract specifics: ports, hostnames, database names, queue topics, cache keys, endpoint paths.
Step 3 — Dependency Mapping (Pass 3)
Using findings from Pass 1 and Pass 2, build a dependency map:
For each service component found, record:
- What it depends on (database, cache, queue, external API)
- Connection details (host, port, credentials location)
- Whether a fallback/circuit-breaker exists for that dependency
- Startup dependencies (what must be running before this component starts)
Trace dependency chains: if A depends on B and B depends on C, record the full chain A -> B -> C.
Step 4 — Failure Mode Identification (Pass 4)
Match detected components against this failure catalog. Only generate runbooks for components actually found in the codebase.
| Component | Failure Modes |
|---|---|
| Database (Postgres) | Connection pool exhaustion, replication lag, disk full, slow queries, deadlocks, migration failures, vacuum bloat |
| Database (MySQL) | Connection pool exhaustion, replication lag, disk full, slow queries, deadlocks, migration failures, table lock contention |
| Database (MongoDB) | Connection pool exhaustion, replication lag, disk full, slow queries, index missing, oplog overflow |
| Redis | Connection timeout, memory pressure (maxmemory), eviction storm, replication lag, persistence failure (RDB/AOF) |
| Kafka | Consumer lag, dead letter queue overflow, broker unreachable, partition rebalance storm, topic retention overflow |
| RabbitMQ | Consumer lag, dead letter queue overflow, broker unreachable, memory alarm, queue length explosion |
| HTTP Service | High latency, 5xx spike, OOM kill, thread/goroutine pool exhaustion, health check failure |
| External API | Timeout, rate limiting, certificate expiry, DNS failure, breaking change in response schema |
| Container/K8s | CrashLoopBackOff, OOMKilled, pod eviction, image pull failure, resource quota exceeded, liveness probe failure |
| Podman/Compose | Container restart loop, port conflict, volume mount failure, network unreachable, image pull failure |
| CI/CD | Build failure, deployment rollback, config drift, secret rotation failure |
| Authentication | Token expiry, identity provider outage, certificate rotation, CORS misconfiguration |
| Storage/Disk | Disk full, log rotation failure, temp file accumulation |
| Network | DNS resolution failure, TLS handshake failure, connection reset, firewall rule change |
| Scheduled Jobs | Job stuck/hung, duplicate execution, missed schedule, resource contention |
For each applicable failure mode:
- Read the relevant source code to extract real values (ports, paths, config keys)
- Generate diagnosis commands using the detected environment (kubectl/podman/systemctl/bare process)
- Generate resolution steps with actual file paths and config values
- Check if a fallback exists in code — if yes, document the degradation path; if no, flag it
- Assign severity by blast radius:
- Critical: takes down the entire service or data path
- High: degrades core functionality significantly
- Medium: degrades non-core functionality or increases latency
- Low: cosmetic or logging-only impact
Step 5 — Log and Metric Pointers
For each failure mode, use Grep to find logging statements and metric emissions near the relevant code paths:
log.,logger.,console.log,console.error,println!,tracing::metric,counter,histogram,gauge,prometheus,micrometer
Record the exact file:line and the log message or metric name. Include these as diagnostic breadcrumbs in each runbook entry.
Step 6 — Generate runbook.md
Write runbook.md to the project root with this structure:
# Operational Runbook — {Service Name}
Generated: {current date YYYY-MM-DD}
Completeness Score: {X}% ({N} of {M} detected components covered)
## Service Overview
### Architecture Summary
{one paragraph describing what the service does based on code analysis}
### Component Inventory
| Component | Type | Location | Port |
|---|---|---|---|
{table of all detected components}
### Dependency Map
{text-based dependency chain, one line per dependency}
{format: ComponentA -> ComponentB (protocol/port) [fallback: yes/no]}
### Startup Order
{numbered list of correct startup sequence}
### Shutdown Order
{numbered list of safe shutdown sequence, reverse of startup}
## Health Checks
| Endpoint | Expected Response | Checks |
|---|---|---|
{table of all detected health check endpoints}
Manual health check commands:
{shell commands to verify each dependency is reachable}
## Failure Runbooks
### {Failure Mode Title}
**Severity:** Critical | High | Medium | Low
**Blast Radius:** {what goes down if this fails}
**Affected Component:** {component} ({file path where detected})
**Fallback Exists:** Yes ({describe}) | No (full outage on this path)
**Symptoms:**
- {observable symptom 1}
- {observable symptom 2}
**Log Breadcrumbs:**
- Check `{file}:{line}` for `{log message pattern}`
- Check metric `{metric_name}` for {what to look for}
**Diagnosis Steps:**
1. {command using real paths/ports/tools}
2. {command}
**Resolution Steps:**
1. {command or action with real file paths}
2. {command}
**Rollback Procedure:**
1. {how to revert if resolution fails}
**Prevention:**
- {what to do to prevent recurrence}
---
{repeat for each failure mode, ordered by severity Critical -> High -> Medium -> Low}
## Configuration Drift Checks
{any mismatches found between infra-as-code and application code expectations}
{format: "Config {file}:{line} says X but app code {file}:{line} expects Y"}
## Appendix
### Detected Endpoints
{full list of HTTP endpoints found}
### Detected Dependencies
{full list of external dependencies with connection details}
### Detected Configuration Files
{full list of config files and their purpose}
### Detected Resilience Patterns
{circuit breakers, retries, rate limiters, feature flags found}
Step 7 — Output Summary
After writing runbook.md, output to the user:
- Number of components detected
- Number of failure runbooks generated
- Completeness score
- Any gaps (components with no runbook coverage)
- Any configuration drift findings
Important Rules
- Never generate runbooks for components not found in the codebase
- Every command in a runbook must use real values from the codebase (real ports, real paths, real hostnames)
- If a config value comes from an environment variable, say so: "check env var $DB_HOST (defined in {file})"
- Do not modify any source files, this is a read-only scan that only writes runbook.md
- If the codebase is too large, scan in chunks by directory
- Skip
node_modules/,vendor/,target/,.git/,dist/,build/,__pycache__/,*.min.js,*.map - Prefer fewer high-quality runbooks over many shallow ones
- Do not add comments to generated commands
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
json-formatter
Validate, format, and minify JSON files when users request JSON validation, formatting, or ask to validate their JSONs
bruno-generator
Scans the entire codebase, detects all HTTP/API endpoints across Java/Spring Boot, Node/Express, Go/Gin, Rust/Actix+Axum, Python/Django, and generates a complete Bruno API client project with .bru files, sample requests, and environments.
infra-automation-generator
leak-detect
Scan code for leaked PII, secrets/credentials, and security vulnerabilities that would get you hacked in production.
skill-evaluator
This skill should be used when the user asks to "evaluate a skill", "review skill quality", "score my skill", "check skill best practices", "rate my skills", "evaluate all skills", "compare skills", or wants to assess skill quality across criteria like clarity, token efficiency, anti-cheating, quality gates, determinism, scope discipline, error recovery, observability, and idempotency.
metrics-report
Scan an entire codebase, discover and run all test types, compute hybrid coverage, evaluate quality, and generate a full metrics report website with trends and charts.
Didn't find tool you were looking for?