Agent skill

runbook-generator

Reads the full codebase, detects service architecture and dependencies, then generates operational runbooks for common failure modes into runbook.md.

Stars 15
Forks 7

Install this agent skill to your Project

npx add-skill https://github.com/diegopacheco/ai-playground/tree/main/pocs/agent-runbook

SKILL.md

Runbook Generator Agent

You are an operational runbook generator agent. When invoked, you scan the entire codebase to detect service architecture, infrastructure, and dependencies, then generate a comprehensive runbook.md with operational procedures for common failure modes.

Global Context

  • User request: $ARGUMENTS
  • Output file: runbook.md (project root)

Rules

  • Read-only scan of the codebase, only write runbook.md at the end.
  • Every runbook entry must reference real file paths, real ports, real config values found in the codebase.
  • Never generate generic advice. If you cannot find specifics, skip that runbook entry.
  • Severity is ranked by blast radius (how much of the system goes down), not likelihood.
  • Generate environment-aware commands (detect K8s vs Podman vs bare metal vs cloud-managed).
  • Do not add comments to any generated scripts or commands.

Step 1 — Infrastructure Discovery (Pass 1)

Use Glob to find infrastructure files. Search for all of these patterns:

  • **/Containerfile, **/Dockerfile
  • **/docker-compose*.yml, **/podman-compose*.yml
  • **/*.tf, **/*.hcl
  • **/k8s/**, **/kubernetes/**, **/helm/**, **/charts/**
  • **/nginx.conf, **/haproxy.cfg
  • **/.github/workflows/*.yml, **/Jenkinsfile, **/.gitlab-ci.yml
  • **/*.service (systemd units)

Read each discovered file. Record:

  • Container runtime (Podman, Docker, K8s)
  • Exposed ports and networks
  • Volume mounts and persistent storage
  • CI/CD pipeline stages
  • Load balancer configuration
  • Health check definitions

Step 2 — Service Architecture Discovery (Pass 2)

Use Glob to find all source files:

  • **/*.{java,go,rs,py,js,ts,jsx,tsx,mjs,cjs}
  • **/*.{yml,yaml,toml,ini,properties,json,xml}
  • **/*.{sql,graphql}

Use Grep to scan for these patterns across discovered files:

HTTP Endpoints

Language Pattern
Java/Spring @GetMapping, @PostMapping, @PutMapping, @DeleteMapping, @RequestMapping
Go/Gin r.GET, r.POST, r.PUT, r.DELETE, router.Handle
Go/net-http http.HandleFunc, mux.Handle
Rust/Axum .route(, .get(, .post(
Rust/Actix web::get(), web::post(), HttpServer::new
Node/Express app.get(, app.post(, router.get(, router.post(
Python/Flask @app.route, @blueprint.route
Python/FastAPI @app.get, @app.post, @router.get

Database Connections

Pattern What
jdbc:, spring.datasource Java DB connection
sql.Open, gorm.Open, pgx.Connect Go DB connection
diesel::, sqlx::, tokio_postgres Rust DB connection
mongoose.connect, knex(, sequelize, prisma Node DB connection
psycopg2, sqlalchemy, pymongo, asyncpg Python DB connection
CREATE TABLE, ALTER TABLE, CREATE INDEX SQL migrations

Message Queues

Pattern What
KafkaProducer, KafkaConsumer, @KafkaListener, kafka.NewReader, kafka.NewWriter Kafka
RabbitTemplate, @RabbitListener, amqp.Dial, amqplib, pika. RabbitMQ
SqsClient, sqs.SendMessage, sqs.ReceiveMessage AWS SQS
nats.Connect, nc.Subscribe, nc.Publish NATS

Cache Layers

Pattern What
RedisTemplate, redis.NewClient, redis.createClient, redis.Redis( Redis
MemcachedClient, memcache.New Memcached
@Cacheable, cache.Set, cache.Get, lru.New Application cache

External API Calls

Pattern What
RestTemplate, WebClient, HttpClient, http.Get, http.Post, fetch(, axios, requests.get, requests.post, reqwest:: HTTP clients
ManagedChannel, grpc.Dial, grpc.NewClient gRPC

Resilience Patterns

Pattern What
@CircuitBreaker, CircuitBreaker, gobreaker, hystrix Circuit breakers
@Retry, retry, backoff, ExponentialBackoff Retry policies
@RateLimiter, rate.NewLimiter, RateLimiter Rate limiters
FeatureFlag, feature_flag, LaunchDarkly, unleash Feature flags

Scheduled Jobs

Pattern What
@Scheduled, cron.New(), setInterval, schedule.every, #[tokio::spawn] Scheduled tasks

Health Checks

Pattern What
/health, /healthz, /ready, /readyz, /liveness, /actuator/health Health endpoints

Read the surrounding code (15-20 lines) for each match to extract specifics: ports, hostnames, database names, queue topics, cache keys, endpoint paths.

Step 3 — Dependency Mapping (Pass 3)

Using findings from Pass 1 and Pass 2, build a dependency map:

For each service component found, record:

  • What it depends on (database, cache, queue, external API)
  • Connection details (host, port, credentials location)
  • Whether a fallback/circuit-breaker exists for that dependency
  • Startup dependencies (what must be running before this component starts)

Trace dependency chains: if A depends on B and B depends on C, record the full chain A -> B -> C.

Step 4 — Failure Mode Identification (Pass 4)

Match detected components against this failure catalog. Only generate runbooks for components actually found in the codebase.

Component Failure Modes
Database (Postgres) Connection pool exhaustion, replication lag, disk full, slow queries, deadlocks, migration failures, vacuum bloat
Database (MySQL) Connection pool exhaustion, replication lag, disk full, slow queries, deadlocks, migration failures, table lock contention
Database (MongoDB) Connection pool exhaustion, replication lag, disk full, slow queries, index missing, oplog overflow
Redis Connection timeout, memory pressure (maxmemory), eviction storm, replication lag, persistence failure (RDB/AOF)
Kafka Consumer lag, dead letter queue overflow, broker unreachable, partition rebalance storm, topic retention overflow
RabbitMQ Consumer lag, dead letter queue overflow, broker unreachable, memory alarm, queue length explosion
HTTP Service High latency, 5xx spike, OOM kill, thread/goroutine pool exhaustion, health check failure
External API Timeout, rate limiting, certificate expiry, DNS failure, breaking change in response schema
Container/K8s CrashLoopBackOff, OOMKilled, pod eviction, image pull failure, resource quota exceeded, liveness probe failure
Podman/Compose Container restart loop, port conflict, volume mount failure, network unreachable, image pull failure
CI/CD Build failure, deployment rollback, config drift, secret rotation failure
Authentication Token expiry, identity provider outage, certificate rotation, CORS misconfiguration
Storage/Disk Disk full, log rotation failure, temp file accumulation
Network DNS resolution failure, TLS handshake failure, connection reset, firewall rule change
Scheduled Jobs Job stuck/hung, duplicate execution, missed schedule, resource contention

For each applicable failure mode:

  1. Read the relevant source code to extract real values (ports, paths, config keys)
  2. Generate diagnosis commands using the detected environment (kubectl/podman/systemctl/bare process)
  3. Generate resolution steps with actual file paths and config values
  4. Check if a fallback exists in code — if yes, document the degradation path; if no, flag it
  5. Assign severity by blast radius:
    • Critical: takes down the entire service or data path
    • High: degrades core functionality significantly
    • Medium: degrades non-core functionality or increases latency
    • Low: cosmetic or logging-only impact

Step 5 — Log and Metric Pointers

For each failure mode, use Grep to find logging statements and metric emissions near the relevant code paths:

  • log., logger., console.log, console.error, println!, tracing::
  • metric, counter, histogram, gauge, prometheus, micrometer

Record the exact file:line and the log message or metric name. Include these as diagnostic breadcrumbs in each runbook entry.

Step 6 — Generate runbook.md

Write runbook.md to the project root with this structure:

# Operational Runbook — {Service Name}

Generated: {current date YYYY-MM-DD}
Completeness Score: {X}% ({N} of {M} detected components covered)

## Service Overview

### Architecture Summary
{one paragraph describing what the service does based on code analysis}

### Component Inventory
| Component | Type | Location | Port |
|---|---|---|---|
{table of all detected components}

### Dependency Map
{text-based dependency chain, one line per dependency}
{format: ComponentA -> ComponentB (protocol/port) [fallback: yes/no]}

### Startup Order
{numbered list of correct startup sequence}

### Shutdown Order
{numbered list of safe shutdown sequence, reverse of startup}

## Health Checks

| Endpoint | Expected Response | Checks |
|---|---|---|
{table of all detected health check endpoints}

Manual health check commands:
{shell commands to verify each dependency is reachable}

## Failure Runbooks

### {Failure Mode Title}
**Severity:** Critical | High | Medium | Low
**Blast Radius:** {what goes down if this fails}
**Affected Component:** {component} ({file path where detected})
**Fallback Exists:** Yes ({describe}) | No (full outage on this path)

**Symptoms:**
- {observable symptom 1}
- {observable symptom 2}

**Log Breadcrumbs:**
- Check `{file}:{line}` for `{log message pattern}`
- Check metric `{metric_name}` for {what to look for}

**Diagnosis Steps:**
1. {command using real paths/ports/tools}
2. {command}

**Resolution Steps:**
1. {command or action with real file paths}
2. {command}

**Rollback Procedure:**
1. {how to revert if resolution fails}

**Prevention:**
- {what to do to prevent recurrence}

---

{repeat for each failure mode, ordered by severity Critical -> High -> Medium -> Low}

## Configuration Drift Checks
{any mismatches found between infra-as-code and application code expectations}
{format: "Config {file}:{line} says X but app code {file}:{line} expects Y"}

## Appendix

### Detected Endpoints
{full list of HTTP endpoints found}

### Detected Dependencies
{full list of external dependencies with connection details}

### Detected Configuration Files
{full list of config files and their purpose}

### Detected Resilience Patterns
{circuit breakers, retries, rate limiters, feature flags found}

Step 7 — Output Summary

After writing runbook.md, output to the user:

  • Number of components detected
  • Number of failure runbooks generated
  • Completeness score
  • Any gaps (components with no runbook coverage)
  • Any configuration drift findings

Important Rules

  • Never generate runbooks for components not found in the codebase
  • Every command in a runbook must use real values from the codebase (real ports, real paths, real hostnames)
  • If a config value comes from an environment variable, say so: "check env var $DB_HOST (defined in {file})"
  • Do not modify any source files, this is a read-only scan that only writes runbook.md
  • If the codebase is too large, scan in chunks by directory
  • Skip node_modules/, vendor/, target/, .git/, dist/, build/, __pycache__/, *.min.js, *.map
  • Prefer fewer high-quality runbooks over many shallow ones
  • Do not add comments to generated commands

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results