Agent skill
troubleshoot
Diagnoses Kubernetes pod issues automatically. Use when pods are in CrashLoopBackOff, ImagePullBackOff, Pending, or Error state. Analyzes events, logs, and resource status to identify root causes.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/troubleshoot-ziwon-homelab
SKILL.md
Kubernetes Troubleshoot
Automatically diagnose common Kubernetes issues.
Trigger Phrases
- "왜 안돼", "왜 안되지", "pod가 죽어", "에러나"
- "troubleshoot", "debug", "diagnose"
- "CrashLoopBackOff", "ImagePullBackOff", "Pending", "Error"
Diagnostic Flow
1. Get Pod Status
export KUBECONFIG=$HOME/.kube/config.home
kubectl get pods -n <namespace> <pod-name> -o wide
2. Check Events
kubectl describe pod -n <namespace> <pod-name> | grep -A 20 "Events:"
3. Get Logs
# Current container
kubectl logs -n <namespace> <pod-name> --tail=100
# Previous container (if restarting)
kubectl logs -n <namespace> <pod-name> --previous --tail=100
# All containers in pod
kubectl logs -n <namespace> <pod-name> --all-containers=true --tail=50
4. Check Resource Status
# Node resources
kubectl top nodes
# Pod resources
kubectl top pods -n <namespace>
# PVC status
kubectl get pvc -n <namespace>
Common Issues & Solutions
CrashLoopBackOff
- Check logs for application errors
- Verify environment variables (Infisical secrets mounted?)
- Check resource limits (OOMKilled?)
- Validate startup/liveness probes
ImagePullBackOff
- Verify image name and tag
- Check imagePullSecrets configured
- Test registry connectivity
- Confirm image exists in registry
Pending
- Check node resources (CPU/Memory)
- Verify nodeSelector/affinity matches
- Check PVC binding status
- Review taints/tolerations
OOMKilled
- Increase memory limits
- Check for memory leaks in application
- Review JVM heap settings (if Java)
GPU Issues
# Check GPU operator
kubectl get pods -n gpu-operator
# Check device plugin
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=50
# Check GPU allocation
kubectl describe node | grep -A 5 "nvidia.com/gpu"
ArgoCD Sync Issues
# App status
argocd app get <app-name>
# Sync details
argocd app sync <app-name> --dry-run
# Resource diff
argocd app diff <app-name>
# Force refresh
argocd app get <app-name> --refresh
Output Format
Provide diagnosis in this format:
## Issue Summary
[Brief description of the problem]
## Root Cause
[Identified cause with evidence]
## Solution
[Step-by-step fix]
## Prevention
[How to avoid this in the future]
Reference
- @.claude/rules/kubernetes.md
- @.claude/rules/argocd-apps.md
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?