Agent skill
coreweave-incident-runbook
Incident response runbook for CoreWeave GPU workload failures. Use when inference services are down, GPUs are unavailable, or responding to production incidents on CoreWeave. Trigger with phrases like "coreweave incident", "coreweave outage", "coreweave runbook", "coreweave service down".
Install this agent skill to your Project
npx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/coreweave-pack/skills/coreweave-incident-runbook
SKILL.md
CoreWeave Incident Runbook
Triage Steps
# 1. Check pod status
kubectl get pods -l app=inference -o wide
# 2. Check recent events
kubectl get events --sort-by=.lastTimestamp | tail -20
# 3. Check node status
kubectl get nodes -l gpu.nvidia.com/class -o wide
# 4. Check GPU health
kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi
Common Incidents
Inference Service Down
- Check pod status and events
- If OOMKilled: reduce batch size or upgrade GPU
- If ImagePullBackOff: check registry credentials
- If Pending: check GPU quota and availability
GPU Node Failure
- Pods will be rescheduled automatically
- If no capacity: scale down non-critical workloads
- Contact CoreWeave support for extended outages
Model Loading Failure
- Check HuggingFace token secret exists
- Verify model name spelling
- Check PVC has sufficient storage
- Review container logs for download errors
Rollback
kubectl rollout undo deployment/inference
Resources
Next Steps
For data handling, see coreweave-data-handling.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
dockerfile-generator
Dockerfile Generator - Auto-activating skill for DevOps Basics. Triggers on: dockerfile generator, dockerfile generator Part of the DevOps Basics skill category.
branch-naming-helper
Branch Naming Helper - Auto-activating skill for DevOps Basics. Triggers on: branch naming helper, branch naming helper Part of the DevOps Basics skill category.
readme-generator
Readme Generator - Auto-activating skill for DevOps Basics. Triggers on: readme generator, readme generator Part of the DevOps Basics skill category.
makefile-generator
Makefile Generator - Auto-activating skill for DevOps Basics. Triggers on: makefile generator, makefile generator Part of the DevOps Basics skill category.
gitignore-generator
Gitignore Generator - Auto-activating skill for DevOps Basics. Triggers on: gitignore generator, gitignore generator Part of the DevOps Basics skill category.
pre-commit-hook-setup
Pre Commit Hook Setup - Auto-activating skill for DevOps Basics. Triggers on: pre commit hook setup, pre commit hook setup Part of the DevOps Basics skill category.
Didn't find tool you were looking for?