Agent skill
Kubernetes AI Expert
Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns
Install this agent skill to your Project
npx add-skill https://github.com/frankxai/ai-architect/tree/main/skills/kubernetes-ai
SKILL.md
Kubernetes AI Expert
Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.
GPU Workload Scheduling
NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator
GPU Resource Requests
| Resource | Description |
|---|---|
nvidia.com/gpu: N |
Request N GPUs |
nvidia.com/mig-3g.40gb: 1 |
MIG slice |
| Node selector | nvidia.com/gpu.product |
| Toleration | nvidia.com/gpu |
Full manifests: resources/manifests.yaml
Model Serving Frameworks
Framework Comparison
| Framework | Best For | GPU Support | Scaling |
|---|---|---|---|
| vLLM | High-throughput LLMs | Excellent | HPA/KEDA |
| Triton | Multi-model serving | Excellent | HPA |
| TGI | HuggingFace models | Good | HPA |
vLLM Deployment
Key configurations:
--tensor-parallel-size- Multi-GPU inference--max-model-len- Context window--gpu-memory-utilization- Memory efficiency
Triton Inference Server
- Multi-model serving from S3/GCS
- HTTP (8000), gRPC (8001), Metrics (8002)
- Model polling for dynamic updates
Text Generation Inference (TGI)
- HuggingFace native
- Quantization support (
bitsandbytes-nf4) - Simple deployment
Deployment manifests: resources/manifests.yaml
Helm Chart Pattern
# values.yaml structure
inference:
enabled: true
replicas: 2
framework: "vllm" # vllm, tgi, triton
resources:
limits:
nvidia.com/gpu: 1
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
vectorDB:
enabled: true
type: "qdrant"
monitoring:
enabled: true
Auto-Scaling
Horizontal Pod Autoscaler (HPA)
Scale on:
- GPU utilization (
DCGM_FI_DEV_GPU_UTIL) - Inference queue length
- Custom metrics
KEDA Event-Driven Scaling
Scale on:
- Prometheus metrics
- Message queue depth (RabbitMQ, SQS)
- Custom external metrics
HPA/KEDA configs: resources/manifests.yaml
Networking
Ingress Configuration
- Rate limiting (nginx annotations)
- TLS with cert-manager
- Large body size for AI payloads
- Extended timeouts (300s+)
Network Policies
- Restrict pod-to-pod communication
- Allow only gateway → inference
- Permit DNS egress
Monitoring
Key Metrics
| Metric | Source | Purpose |
|---|---|---|
| GPU Utilization | DCGM Exporter | Scaling |
| Inference Latency | Prometheus | SLO |
| Tokens/Second | Custom | Throughput |
| Queue Length | App metrics | Scaling |
Setup
# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter
# ServiceMonitor for Prometheus
# See resources/manifests.yaml
Managed Kubernetes
AWS EKS
- Instance types:
g5.2xlarge,p4d.24xlarge - AMI:
AL2_x86_64_GPU - GPU taints for isolation
Azure AKS
- VM sizes:
Standard_NC*,Standard_ND* - A100 support via
NC24ads_A100_v4
OCI OKE
- Shapes:
BM.GPU.A100-v2.8,VM.GPU.A10 - GPU node pools with taints
Terraform examples: ../terraform-iac/resources/modules.tf
Best Practices
Resource Management
- Always set GPU limits = requests
- Use node selectors for GPU types
- Implement tolerations for GPU taints
- PVC for model caching
High Availability
- Multiple replicas across zones
- Pod disruption budgets
- Readiness/liveness probes
Cost Optimization
- Spot instances for dev/test
- Auto-scaling to zero when idle
- Right-size GPU instances
Resources
Deploy AI workloads at scale with GPU-optimized Kubernetes.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
GenAI DAC Specialist
Expert in OCI Generative AI Dedicated AI Clusters - deployment, fine-tuning, optimization, and production operations
Oracle Agent Spec Expert
Design framework-agnostic AI agents using Oracle's Open Agent Specification for portable, interoperable agentic systems with JSON/YAML definitions
AI Security Expert
Enterprise AI security - OWASP LLM Top 10, prompt injection defense, guardrails, PII protection
OCI Services Expert
Expert guidance on Oracle Cloud Infrastructure services, cloud architecture patterns, cost optimization, deployment strategies, and OCI best practices for enterprise solutions
agentic-orchestration
Patterns for multi-agent coordination, task decomposition, handoffs, and workflow orchestration. Best practices for building and managing agent systems.
nvidia-nim
NVIDIA NIM inference microservices for deploying AI models with OpenAI-compatible APIs, self-hosted or cloud
Didn't find tool you were looking for?