Agent skill
castai-common-errors
Diagnose and fix CAST AI agent, API, and autoscaler errors. Use when the CAST AI agent is offline, nodes are not scaling, or API calls return errors. Trigger with phrases like "cast ai error", "cast ai not working", "cast ai agent offline", "cast ai debug", "fix cast ai".
Install this agent skill to your Project
npx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/castai-pack/skills/castai-common-errors
SKILL.md
CAST AI Common Errors
Overview
Diagnostic guide for the 10 most common CAST AI issues, covering agent connectivity, API errors, autoscaler failures, and node provisioning problems.
Prerequisites
kubectlaccess to the clusterCASTAI_API_KEYconfigured- Access to CAST AI console for log correlation
Error Reference
1. Agent Pod CrashLoopBackOff
kubectl get pods -n castai-agent
kubectl logs -n castai-agent deployment/castai-agent --tail=50
Causes and fixes:
- Invalid API key: Regenerate at console.cast.ai > API
- Wrong provider: Set
--set provider=eks|gke|akscorrectly in Helm - RBAC missing: Apply the required ClusterRole and ClusterRoleBinding
- Network blocked: Ensure outbound HTTPS to
api.cast.aiis allowed
2. Agent Shows "Disconnected" in Console
# Check agent heartbeat
kubectl logs -n castai-agent deployment/castai-agent | grep -i "heartbeat\|connect\|error"
# Verify network connectivity from inside the cluster
kubectl run castai-debug --image=curlimages/curl --rm -it --restart=Never -- \
curl -s -o /dev/null -w "%{http_code}" https://api.cast.ai/v1/kubernetes/external-clusters
Fix: Restart the agent pod: kubectl rollout restart deployment/castai-agent -n castai-agent
3. API Returns 401 Unauthorized
# Test API key
curl -s -o /dev/null -w "%{http_code}" \
-H "X-API-Key: ${CASTAI_API_KEY}" \
https://api.cast.ai/v1/kubernetes/external-clusters
# Should return 200, not 401
Fix: Generate a new API key at console.cast.ai > API > API Access Keys.
4. Nodes Not Scaling Up (Unschedulable Pods)
# Check for pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
# Verify unschedulable pods policy is enabled
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.unschedulablePods'
Causes:
unschedulablePods.enabledisfalse-- enable it- Cluster limits reached -- increase
clusterLimits.cpu.maxCores - No matching node template -- check constraints match pod requirements
5. Nodes Not Scaling Down (Empty Nodes)
# Check node downscaler configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.nodeDownscaler'
Causes:
nodeDownscaler.enabledisfalse- Pods with
PodDisruptionBudgetblocking eviction - DaemonSet-only nodes with system pods preventing drain
- Delay too high -- reduce
emptyNodes.delaySeconds
6. Spot Instance Fallback Not Working
# Check spot configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.spotInstances'
Fix: Enable spotDiversityEnabled: true and set spotDiversityPriceIncreaseLimitPercent to 20-30 for better availability.
7. Evictor Too Aggressive
Symptoms: Pods being evicted too frequently, service disruption.
kubectl get events --field-selector reason=Evicted -A --sort-by=.lastTimestamp | tail -20
Fix: Increase evictor cycle interval or switch to non-aggressive mode:
helm upgrade castai-evictor castai-helm/castai-evictor \
-n castai-agent \
--set castai.apiKey="${CASTAI_API_KEY}" \
--set castai.clusterID="${CASTAI_CLUSTER_ID}" \
--set evictor.aggressiveMode=false \
--set evictor.cycleInterval=600
8. Terraform State Drift
terraform plan -var-file=environments/prod.tfvars
# If drift detected:
terraform refresh -var-file=environments/prod.tfvars
Fix: Avoid mixing Terraform and console-based policy changes. Pick one source of truth.
9. Helm Chart Version Mismatch
# Check installed versions
helm list -n castai-agent
helm search repo castai-helm --versions | head -10
# Update to latest
helm repo update
helm upgrade castai-agent castai-helm/castai-agent -n castai-agent \
--reuse-values
10. Workload Autoscaler Not Recommending
kubectl logs -n castai-agent deployment/castai-workload-autoscaler --tail=50
Causes:
- Insufficient metrics data (wait 24h)
- Missing annotation
autoscaling.cast.ai/enabled: "true" - Workload autoscaler pod not running
Escalation Path
- Collect debug info: Helm releases, agent logs, cluster events
- Check https://status.cast.ai for platform issues
- Contact support with cluster ID and screenshots
Resources
Next Steps
For comprehensive diagnostics, see castai-debug-bundle.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
dockerfile-generator
Dockerfile Generator - Auto-activating skill for DevOps Basics. Triggers on: dockerfile generator, dockerfile generator Part of the DevOps Basics skill category.
branch-naming-helper
Branch Naming Helper - Auto-activating skill for DevOps Basics. Triggers on: branch naming helper, branch naming helper Part of the DevOps Basics skill category.
readme-generator
Readme Generator - Auto-activating skill for DevOps Basics. Triggers on: readme generator, readme generator Part of the DevOps Basics skill category.
makefile-generator
Makefile Generator - Auto-activating skill for DevOps Basics. Triggers on: makefile generator, makefile generator Part of the DevOps Basics skill category.
gitignore-generator
Gitignore Generator - Auto-activating skill for DevOps Basics. Triggers on: gitignore generator, gitignore generator Part of the DevOps Basics skill category.
pre-commit-hook-setup
Pre Commit Hook Setup - Auto-activating skill for DevOps Basics. Triggers on: pre commit hook setup, pre commit hook setup Part of the DevOps Basics skill category.
Didn't find tool you were looking for?