Platform Health Check Skill

This skill helps you perform comprehensive platform health checks and identify issues quickly.

When to Use

After deployments or cluster restarts
Before making changes (baseline health)
During incident investigation
Regular health monitoring
After running tests
User requests "check platform" or "is everything working"

What This Skill Does

Quick Health Overview: One-command platform status
ArgoCD Apps: Health and sync status of all applications
Pod Health: Check pods across all namespaces
Service Accessibility: Test Gateway routes and certificates
Resource Usage: CPU/memory consumption
Component-Specific Checks: Detailed validation per component

Quick Health Check

Comprehensive Platform Status

bash

# Single command for full platform health (includes pytest tests)
./scripts/platform-status.sh

# What it checks:
# ✓ ArgoCD applications (health & sync status)
# ✓ Platform pods (all namespaces)
# ✓ Gateway & certificates
# ✓ Istio mTLS configuration
# ✓ Service accessibility (via Gateway)
# ✓ OAuth authentication
# ✓ Integration tests (pytest)

Expected Output:

=== ArgoCD Applications Status ===
✓ gateway-api: Healthy, Synced
✓ cert-manager: Healthy, Synced
✓ istio-base: Healthy, Synced
...

=== Platform Pods ===
observability    grafana-xxx         2/2     Running
observability    prometheus-xxx      2/2     Running
...

=== Gateway & Certificates ===
✓ external-gateway: Programmed
✓ grafana-cert: Ready
...

=== Integration Tests ===
PASSED tests/validation/test_app_state.py::test_critical_apps
...

Quick Status Commands

bash

# ArgoCD apps summary
argocd app list --port-forward --port-forward-namespace argocd --grpc-web

# All pods summary
kubectl get pods -A

# Failing pods only
kubectl get pods -A | grep -vE "Running|Completed"

# Service endpoints
kubectl get svc -A

# Gateway status
kubectl get gateway -A

# Certificate status
kubectl get certificate -A

Detailed Health Checks

1. ArgoCD Application Health

bash

# List all apps with health status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web \
  -o json | jq -r '.[] | "\(.metadata.name): \(.status.health.status), \(.status.sync.status)"'

# Check for unhealthy apps
argocd app list --port-forward --port-forward-namespace argocd --grpc-web \
  | grep -E "Degraded|OutOfSync|Unknown|Missing"

# Get details for specific app
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web

# Check app sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web

Expected States:

Health: Healthy (✓), Progressing (⚠️), Degraded (❌), Missing (❌)
Sync: Synced (✓), OutOfSync (⚠️)

Critical Apps (must be Healthy):

gateway-api
cert-manager
istio-base, istiod
tekton-pipelines
keycloak
kagenti-operator, kagenti-platform-operator
kagenti-platform
kagenti-ui

Optional Apps (can be Progressing):

observability (large images, slow startup)
kiali
ollama

2. Pod Health by Namespace

bash

# All pods with status
kubectl get pods -A -o wide

# Pods sorted by restarts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20

# Pods with issues
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Pod resource usage
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu

# Specific namespace health
kubectl get pods -n observability
kubectl get pods -n keycloak
kubectl get pods -n kagenti-system

Check for these statuses:

❌ CrashLoopBackOff: Application crashes on startup
❌ ImagePullBackOff: Image not available
❌ Error: Container exited with error
⚠️ Pending: Waiting for resources or scheduling
⚠️ Init: Init containers still running
✓ Running: Pod healthy
✓ Completed: Job finished successfully

3. Service Accessibility

bash

# Test all platform services via Gateway
for service in grafana prometheus tempo phoenix kiali keycloak kagenti; do
  echo "=== Testing https://$service.localtest.me:9443/ ==="
  curl -k -I -m 5 "https://$service.localtest.me:9443/" 2>&1 | head -3
  echo
done

# Check Gateway status
kubectl get gateway -A
kubectl describe gateway external-gateway -n default

# Check HTTPRoutes
kubectl get httproute -A
kubectl describe httproute <route-name> -n <namespace>

# Check service endpoints (should have IP addresses)
kubectl get endpoints -A | grep -v "<none>"

Expected Results:

Grafana: HTTP/2 302 (redirect to /login)
Prometheus: HTTP/2 302 (OAuth redirect)
Keycloak: HTTP/2 200
Kagenti UI: HTTP/2 200

4. Certificate Health

bash

# All certificates status
kubectl get certificate -A

# Check certificate details
kubectl describe certificate <cert-name> -n <namespace>

# Check cert-manager logs for issues
kubectl logs -n cert-manager deployment/cert-manager --tail=50

# Verify certificate expiration
kubectl get certificate -A -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): expires \(.status.notAfter)"'

Expected State: All certificates show Ready=True

5. Istio Service Mesh Health

bash

# Check Istio components
kubectl get pods -n istio-system

# Verify sidecar injection (should show 2/2 containers)
kubectl get pods -A -o wide | grep "2/2"

# Check mTLS policies
kubectl get peerauthentication -A
kubectl get destinationrule -A

# Istio proxy status
istioctl proxy-status

# Check specific pod mesh config
istioctl x describe pod <pod-name> -n <namespace>

6. Resource Usage

bash

# Node resources
kubectl top nodes

# Cluster-wide pod resources
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20

# Namespace resource usage
kubectl top pods -n observability
kubectl top pods -n keycloak
kubectl top pods -n kagenti-system

# Check for resource pressure
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="MemoryPressure" or .type=="DiskPressure") | .type)=\(.status)"'

7. Storage Health

bash

# PersistentVolumes
kubectl get pv

# PersistentVolumeClaims
kubectl get pvc -A

# Check PVC usage via metrics
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100' \
  | python3 -m json.tool

Component-Specific Health Checks

Observability Stack

bash

# Prometheus
kubectl get pods -n observability -l app=prometheus
kubectl exec -n observability deployment/grafana -- \
  curl -s http://prometheus.observability.svc:9090/-/ready

# Grafana
kubectl get pods -n observability -l app=grafana
curl -k -I https://grafana.localtest.me:9443/api/health

# Loki
kubectl get pods -n observability -l app=loki
kubectl exec -n observability deployment/grafana -- \
  curl -s http://loki.observability.svc:3100/ready

# Tempo
kubectl get pods -n observability -l app=tempo
kubectl exec -n observability deployment/grafana -- \
  curl -s http://tempo-query-frontend.observability.svc:3100/ready

# Phoenix
kubectl get pods -n observability -l app=phoenix
curl -k -I https://phoenix.localtest.me:9443/

# AlertManager
kubectl get pods -n observability -l app=alertmanager
kubectl exec -n observability deployment/alertmanager -c alertmanager -- \
  wget -qO- http://localhost:9093/-/ready

Authentication & Authorization

bash

# Keycloak
kubectl get pods -n keycloak -l app=keycloak
kubectl exec -n keycloak statefulset/keycloak -- \
  curl -s http://localhost:8080/health/ready | python3 -m json.tool

# OAuth2-Proxy instances
kubectl get pods -n oauth2-proxy
kubectl get deployment -n oauth2-proxy

# Test Keycloak SSO
curl -k "https://keycloak.localtest.me:9443/realms/master/.well-known/openid-configuration"

Platform Components

bash

# Kagenti Operator
kubectl get pods -n kagenti-operator
kubectl logs -n kagenti-operator deployment/kagenti-operator --tail=20

# Kagenti Platform Operator
kubectl get pods -n kagenti-platform-operator
kubectl logs -n kagenti-platform-operator deployment/kagenti-platform-operator --tail=20

# Kagenti UI
kubectl get pods -n kagenti-platform -l app=kagenti-ui
curl -k -I https://kagenti.localtest.me:9443/

# Tekton Pipelines
kubectl get pods -n tekton-pipelines
kubectl get pipelineruns -A

Health Check Checklists

Post-Deployment Health Check

All ArgoCD apps Healthy and Synced
No pods in CrashLoopBackOff/ImagePullBackOff
All services have endpoints
All certificates Ready
All Gateway routes Programmed
Services accessible via browser
Integration tests passing
No firing critical alerts

Pre-Change Health Check

Capture platform snapshot: ./scripts/capture-platform-snapshot.sh before-change
All critical apps Healthy
No existing incidents in TODO_INCIDENTS.md
Resource usage within limits
Recent Git commits validated

Incident Investigation Health Check

Identify degraded components
Check recent events
Collect logs from affected pods
Query metrics for anomalies
Check for correlated failures
Review recent changes (Git history)

Common Health Issues

Issue: Pods stuck in Pending

bash

# Check pod description for reason
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Insufficient CPU/memory
# - No nodes matching nodeSelector
# - Unbound PersistentVolumeClaim

Issue: Pods CrashLoopBackOff

bash

# Check previous logs
kubectl logs <pod-name> -n <namespace> --previous

# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Common causes:
# - Application error on startup
# - Missing configuration
# - Dependency not available

Issue: Service not accessible

bash

# Check pod status
kubectl get pods -n <namespace> -l app=<service>

# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>

# Check HTTPRoute
kubectl get httproute -n <namespace>

# Test from inside cluster
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
  -- curl http://<service-name>.<namespace>.svc:PORT

Issue: Certificate not Ready

bash

# Check certificate status
kubectl describe certificate <cert-name> -n <namespace>

# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager

# Common causes:
# - DNS validation failing
# - Rate limit reached
# - Invalid configuration

Issue: High resource usage

bash

# Find top consumers
kubectl top pods -A --sort-by=memory | head -10
kubectl top pods -A --sort-by=cpu | head -10

# Check for memory leaks
kubectl logs <pod-name> -n <namespace> | grep -i "out of memory"

# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Limits:"

Automation & Monitoring

Continuous Health Monitoring

bash

# Watch pod status
watch -n 5 'kubectl get pods -A | grep -vE "Running|Completed"'

# Watch ArgoCD apps
watch -n 10 'argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -vE "Healthy.*Synced"'

# Monitor specific namespace
watch -n 5 'kubectl get pods -n observability'

Scheduled Health Checks

bash

# Cron job for periodic health checks (local dev)
# Add to crontab: crontab -e
*/15 * * * * /path/to/kagenti-demo-deployment/scripts/platform-status.sh > /tmp/health-$(date +\%Y\%m\%d-\%H\%M).log 2>&1

# Compare snapshots over time
./scripts/capture-platform-snapshot.sh hourly-check

Integration with Other Skills

After health check, if issues found:

Use investigate-incident skill for RCA
Use check-logs skill to examine error logs
Use check-metrics skill for performance analysis
Use check-alerts skill to see if alerts fired

Pro Tips

Always baseline first: Run health check BEFORE making changes
Use platform-status.sh: Single command for comprehensive check
Capture snapshots: Use capture-platform-snapshot.sh for historical comparison
Check critical apps first: Focus on gateway-api, istio, keycloak, operators
Look for patterns: Multiple pods failing often indicates cluster-wide issue
Check Git history: Recent commits may explain new issues
Verify after fixes: Always re-run health check after remediation

🤖 Generated with Claude Code

Search AI Tools

platform-health

Install this agent skill to your Project

SKILL.md

Platform Health Check Skill

When to Use

What This Skill Does

Quick Health Check

Comprehensive Platform Status

Quick Status Commands

Detailed Health Checks

1. ArgoCD Application Health

2. Pod Health by Namespace

3. Service Accessibility

4. Certificate Health

5. Istio Service Mesh Health

6. Resource Usage

7. Storage Health

Component-Specific Health Checks

Observability Stack

Authentication & Authorization

Platform Components

Health Check Checklists

Post-Deployment Health Check

Pre-Change Health Check

Incident Investigation Health Check

Common Health Issues

Issue: Pods stuck in Pending

Issue: Pods CrashLoopBackOff

Issue: Service not accessible

Issue: Certificate not Ready

Issue: High resource usage

Automation & Monitoring

Continuous Health Monitoring

Scheduled Health Checks

Related Documentation

Integration with Other Skills

Pro Tips