Agent skill
k8s-incident
Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.
Install this agent skill to your Project
npx add-skill https://github.com/rohitg00/kubectl-mcp-server/tree/main/kubernetes-skills/claude/k8s-incident
Metadata
Additional technical details for this skill
- tools
- 15
- author
- rohitg00
- version
- 1.0.0
- category
- observability
SKILL.md
Kubernetes Incident Response
Runbooks and diagnostic workflows for common Kubernetes incidents.
When to Apply
Use this skill when:
- User mentions: "incident", "outage", "emergency", "down", "not working"
- Operations: emergency response, production issues, service degradation
- Keywords: "urgent", "broken", "fix", "restore", "recover"
Priority Rules
| Priority | Rule | Impact | Tools |
|---|---|---|---|
| 1 | Check control plane first | CRITICAL | get_pods(namespace="kube-system") |
| 2 | Assess node health | CRITICAL | get_nodes |
| 3 | Gather events before changes | HIGH | get_events |
| 4 | Document timeline | HIGH | Manual notes |
| 5 | Rollback if safe | MEDIUM | rollback_deployment |
Quick Reference
| Incident | First Tool | Next Steps |
|---|---|---|
| Pod failure | get_pod_logs(previous=True) |
describe_pod, get_events |
| Node down | describe_node |
Check kubelet logs |
| Service unreachable | get_endpoints |
get_network_policies |
| Control plane | get_pods(namespace="kube-system") |
Check API server logs |
Incident Triage
Quick Health Check
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)
Severity Assessment
| Indicator | Severity | Action |
|---|---|---|
| Multiple nodes NotReady | Critical | Escalate immediately |
| kube-system pods failing | Critical | Control plane issue |
| Single pod CrashLoop | Medium | Debug pod |
| High latency | Medium | Check resources |
Runbook: Pod Failures
CrashLoopBackOff
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)
Common Causes:
- OOMKilled → Increase memory limits
- Exit code 1 → Application error in logs
- Exit code 137 → Killed by OOM or SIGKILL
- Exit code 143 → Graceful SIGTERM
ImagePullBackOff
describe_pod(name, namespace)
get_secrets(namespace)
Pending Pod
describe_pod(name, namespace)
get_nodes()
get_events(namespace)
Runbook: Node Issues
Node NotReady
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")
Node DiskPressure
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")
Runbook: Network Issues
Service Not Accessible
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)
DNS Resolution Failures
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")
With Cilium
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)
With Istio
istio_analyze_tool(namespace)
istio_proxy_status_tool()
Runbook: Storage Issues
PVC Pending
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)
Pod Stuck in ContainerCreating
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)
Runbook: Control Plane Issues
API Server Unavailable
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")
etcd Issues
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")
Emergency Actions
Force Delete Pod
delete_pod(name, namespace, grace_period=0, force=True)
Rollback Deployment
rollback_deployment(name, namespace, revision=0)
Helm Rollback
rollback_helm_release(name, namespace, revision=1)
Diagnostic Collection Script
For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.
Multi-Cluster Incident Response
Check all clusters:
for context in ["prod-1", "prod-2", "staging"]:
get_nodes(context=context)
get_pods(namespace="kube-system", context=context)
get_events(namespace="kube-system", context=context)
Post-Incident
Document Timeline
- When did the incident start?
- What was the impact?
- What was the root cause?
- What fixed it?
Prevent Recurrence
- Add monitoring/alerting
- Improve resource limits
- Add readiness probes
- Document runbook
Related Skills
- k8s-troubleshoot - Detailed debugging
- k8s-security - Security incidents
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
k8s-multicluster
Manage multiple Kubernetes clusters, switch contexts, and perform cross-cluster operations. Use when working with multiple clusters, comparing environments, or managing cluster lifecycle.
k8s-gitops
Manage GitOps workflows with Flux and ArgoCD. Use for sync status, reconciliation, app management, source management, and GitOps troubleshooting.
k8s-autoscaling
Configure Kubernetes autoscaling with HPA, VPA, and KEDA. Use for horizontal/vertical pod autoscaling, event-driven scaling, and capacity management.
k8s-deploy
Deploy and manage Kubernetes workloads with progressive delivery. Use for deployments, rollouts, blue-green, canary releases, scaling, and release management.
k8s-cost
Optimize Kubernetes costs through resource right-sizing, unused resource detection, and cluster efficiency analysis. Use for cost optimization, resource analysis, and capacity planning.
k8s-rollouts
Progressive delivery with Argo Rollouts and Flagger. Use when implementing canary deployments, blue-green deployments, or traffic shifting strategies.
Didn't find tool you were looking for?