Agent skill
vllm-deployment
Deploy vLLM for high-performance LLM inference. Covers Docker CPU/GPU deployments and cloud VM provisioning with OpenAI-compatible API endpoints.
Install this agent skill to your Project
npx add-skill https://github.com/stakpak/community-paks/tree/main/vllm-deployment
Metadata
Additional technical details for this skill
- author
- Stakpak <team@stakpak.dev>
- version
- 1.0.3
SKILL.md
vLLM Model Serving and Inference
Quick Start
Docker (CPU)
docker run --rm -p 8000:8000 \
--shm-size=4g \
--cap-add SYS_NICE \
--security-opt seccomp=unconfined \
-e VLLM_CPU_KVCACHE_SPACE=4 \
<vllm-cpu-image> \
--model <model-name> \
--dtype float32
# Access: http://localhost:8000
Docker (GPU)
docker run --rm -p 8000:8000 \
--gpus all \
--shm-size=4g \
<vllm-gpu-image> \
--model <model-name>
# Access: http://localhost:8000
Docker Deployment
1. Assess Hardware Requirements
| Hardware | Minimum RAM | Recommended |
|---|---|---|
| CPU | 2x model size | 4x model size |
| GPU | Model size + 2GB | Model size + 4GB VRAM |
- Check model documentation for specific requirements
- Consider quantized variants to reduce memory footprint
- Allocate 50-100GB storage for model downloads
2. Pull the Container Image
# CPU image (check vLLM docs for latest tag)
docker pull <vllm-cpu-image>
# GPU image (check vLLM docs for latest tag)
docker pull <vllm-gpu-image>
Notes:
- Use CPU-specific images for CPU inference
- Use CUDA-enabled images matching your GPU architecture
- Verify CPU instruction set compatibility (AVX512, AVX2)
3. Configure and Run
CPU Deployment:
docker run --rm \
--shm-size=4g \
--cap-add SYS_NICE \
--security-opt seccomp=unconfined \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=4 \
-e VLLM_CPU_OMP_THREADS_BIND=0-7 \
<vllm-cpu-image> \
--model <model-name> \
--dtype float32 \
--max-model-len 2048
GPU Deployment:
docker run --rm \
--gpus all \
--shm-size=4g \
-p 8000:8000 \
<vllm-gpu-image> \
--model <model-name> \
--dtype auto \
--max-model-len 4096
4. Verify Deployment
# Check health
curl http://localhost:8000/health
# List models
curl http://localhost:8000/v1/models
# Test inference
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "<model-name>", "prompt": "Hello", "max_tokens": 10}'
5. Update
docker pull <vllm-image>
docker stop <container-id>
# Re-run with same parameters
Cloud VM Deployment
1. Provision Infrastructure
# Create security group with rules:
# - TCP 22 (SSH)
# - TCP 8000 (API)
# Launch instance with:
# - Sufficient RAM/VRAM for model
# - Docker pre-installed (or install manually)
# - 50-100GB root volume
# - Public IP for external access
2. Connect and Deploy
ssh -i <key-file> <user>@<instance-ip>
# Install Docker if not present
# Pull and run vLLM container (see Docker Deployment section)
3. Verify External Access
# From local machine
curl http://<instance-ip>:8000/health
curl http://<instance-ip>:8000/v1/models
4. Cleanup
# Stop container
docker stop <container-id>
# Terminate instance to stop costs
# Delete associated resources (volumes, security groups) if temporary
Configuration Reference
Environment Variables
| Variable | Purpose | Example |
|---|---|---|
VLLM_CPU_KVCACHE_SPACE |
KV cache size in GB (CPU) | 4 |
VLLM_CPU_OMP_THREADS_BIND |
CPU core binding (CPU) | 0-7 |
CUDA_VISIBLE_DEVICES |
GPU device selection | 0,1 |
HF_TOKEN |
HuggingFace authentication | hf_xxx |
Docker Flags
| Flag | Purpose |
|---|---|
--shm-size=4g |
Shared memory for IPC |
--cap-add SYS_NICE |
NUMA optimization (CPU) |
--security-opt seccomp=unconfined |
Memory policy syscalls (CPU) |
--gpus all |
GPU access |
-p 8000:8000 |
Port mapping |
vLLM Arguments
| Argument | Purpose | Example |
|---|---|---|
--model |
Model name/path | <model-name> |
--dtype |
Data type | float32, auto, bfloat16 |
--max-model-len |
Max context length | 2048 |
--tensor-parallel-size |
Multi-GPU parallelism | 2 |
API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET | Health check |
/v1/models |
GET | List available models |
/v1/completions |
POST | Text completion |
/v1/chat/completions |
POST | Chat completion |
/metrics |
GET | Prometheus metrics |
Production Checklist
- Verify model fits in available memory
- Configure appropriate data type for hardware
- Set up firewall/security group rules
- Test API endpoints before production use
- Configure monitoring (Prometheus metrics)
- Set up health check alerts
- Document model and configuration used
- Plan for model updates and rollbacks
Troubleshooting
| Issue | Solution |
|---|---|
| Container exits immediately | Increase RAM or use smaller model |
| Slow inference (CPU) | Verify OMP thread binding configuration |
| Connection refused externally | Check firewall/security group rules |
| Model download fails | Set HF_TOKEN for gated models |
| Out of memory during inference | Reduce max_model_len or batch size |
| Port already in use | Change host port mapping |
| Warmup takes too long | Normal for large models (1-5 min) |
References
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
how-to-publish-paks
A practical guide for creating and publishing high-quality Agent Skills (paks) to the Paks registry. Covers SKILL.md format, frontmatter structure, content writing best practices, validation, versioning, and publishing workflow.
beads-issue-tracker
Guide for using Beads (bd), a dependency-aware issue tracker for AI agents. Issues chained together like beads.
dockerization
Official Stakpak application containerization standard operating procedure, a step-by-step guidline to properly dockerize applications. This is a rule book curated by the Stakpak Team.
simple-deployment-on-vm
How to do simple but secure deployments using virtual machines on different cloud providers
migrating-bitnami-to-bitnami-legacy
This rule book helps you migrate Bitnami Helm charts and container images from the bitnami repository to the bitnamilegacy repository. This migration is necessary due to Bitnami's transition, effective August 28th, 2025, where existing images will be moved to the legacy repo
cloudflare-tunnel-ec2-deployment
Deploy containerized applications to AWS EC2 and expose them publicly via Cloudflare Tunnel with automatic HTTPS. Eliminates need for load balancers, SSL certificates, or public inbound ports.
Didn't find tool you were looking for?