Agent skill

vllm-deployment

Deploy vLLM for high-performance LLM inference. Covers Docker CPU/GPU deployments and cloud VM provisioning with OpenAI-compatible API endpoints.

View SKILL.md on GitHub Repository

Stars 3

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/stakpak/community-paks/tree/main/vllm-deployment

Metadata

Additional technical details for this skill

author: Stakpak <team@stakpak.dev>
version: 1.0.3

SKILL.md

vLLM Model Serving and Inference

Quick Start

Docker (CPU)

bash

docker run --rm -p 8000:8000 \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32
# Access: http://localhost:8000

Docker (GPU)

bash

docker run --rm -p 8000:8000 \
  --gpus all \
  --shm-size=4g \
  <vllm-gpu-image> \
  --model <model-name>
# Access: http://localhost:8000

Docker Deployment

1. Assess Hardware Requirements

Hardware	Minimum RAM	Recommended
CPU	2x model size	4x model size
GPU	Model size + 2GB	Model size + 4GB VRAM

Check model documentation for specific requirements
Consider quantized variants to reduce memory footprint
Allocate 50-100GB storage for model downloads

2. Pull the Container Image

bash

# CPU image (check vLLM docs for latest tag)
docker pull <vllm-cpu-image>

# GPU image (check vLLM docs for latest tag)
docker pull <vllm-gpu-image>

Notes:

Use CPU-specific images for CPU inference
Use CUDA-enabled images matching your GPU architecture
Verify CPU instruction set compatibility (AVX512, AVX2)

3. Configure and Run

CPU Deployment:

bash

docker run --rm \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -p 8000:8000 \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  -e VLLM_CPU_OMP_THREADS_BIND=0-7 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32 \
  --max-model-len 2048

GPU Deployment:

bash

docker run --rm \
  --gpus all \
  --shm-size=4g \
  -p 8000:8000 \
  <vllm-gpu-image> \
  --model <model-name> \
  --dtype auto \
  --max-model-len 4096

4. Verify Deployment

bash

# Check health
curl http://localhost:8000/health

# List models
curl http://localhost:8000/v1/models

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "prompt": "Hello", "max_tokens": 10}'

5. Update

bash

docker pull <vllm-image>
docker stop <container-id>
# Re-run with same parameters

Cloud VM Deployment

1. Provision Infrastructure

bash

# Create security group with rules:
# - TCP 22 (SSH)
# - TCP 8000 (API)

# Launch instance with:
# - Sufficient RAM/VRAM for model
# - Docker pre-installed (or install manually)
# - 50-100GB root volume
# - Public IP for external access

2. Connect and Deploy

bash

ssh -i <key-file> <user>@<instance-ip>

# Install Docker if not present
# Pull and run vLLM container (see Docker Deployment section)

3. Verify External Access

bash

# From local machine
curl http://<instance-ip>:8000/health
curl http://<instance-ip>:8000/v1/models

4. Cleanup

bash

# Stop container
docker stop <container-id>

# Terminate instance to stop costs
# Delete associated resources (volumes, security groups) if temporary

Configuration Reference

Environment Variables

Variable	Purpose	Example
`VLLM_CPU_KVCACHE_SPACE`	KV cache size in GB (CPU)	`4`
`VLLM_CPU_OMP_THREADS_BIND`	CPU core binding (CPU)	`0-7`
`CUDA_VISIBLE_DEVICES`	GPU device selection	`0,1`
`HF_TOKEN`	HuggingFace authentication	`hf_xxx`

Docker Flags

Flag	Purpose
`--shm-size=4g`	Shared memory for IPC
`--cap-add SYS_NICE`	NUMA optimization (CPU)
`--security-opt seccomp=unconfined`	Memory policy syscalls (CPU)
`--gpus all`	GPU access
`-p 8000:8000`	Port mapping

vLLM Arguments

Argument	Purpose	Example
`--model`	Model name/path	`<model-name>`
`--dtype`	Data type	`float32`, `auto`, `bfloat16`
`--max-model-len`	Max context length	`2048`
`--tensor-parallel-size`	Multi-GPU parallelism	`2`

API Endpoints

Endpoint	Method	Purpose
`/health`	GET	Health check
`/v1/models`	GET	List available models
`/v1/completions`	POST	Text completion
`/v1/chat/completions`	POST	Chat completion
`/metrics`	GET	Prometheus metrics

Production Checklist

Verify model fits in available memory
Configure appropriate data type for hardware
Set up firewall/security group rules
Test API endpoints before production use
Configure monitoring (Prometheus metrics)
Set up health check alerts
Document model and configuration used
Plan for model updates and rollbacks

Troubleshooting

Issue	Solution
Container exits immediately	Increase RAM or use smaller model
Slow inference (CPU)	Verify OMP thread binding configuration
Connection refused externally	Check firewall/security group rules
Model download fails	Set HF_TOKEN for gated models
Out of memory during inference	Reduce max_model_len or batch size
Port already in use	Change host port mapping
Warmup takes too long	Normal for large models (1-5 min)

References

Maintainer

stakpak Core maintainer

Source details

Full Name: stakpak/community-paks
Branch: main
Path in repo: vllm-deployment

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

stakpak/community-paks

how-to-publish-paks

A practical guide for creating and publishing high-quality Agent Skills (paks) to the Paks registry. Covers SKILL.md format, frontmatter structure, content writing best practices, validation, versioning, and publishing workflow.

3 0

Explore

stakpak/community-paks

beads-issue-tracker

Guide for using Beads (bd), a dependency-aware issue tracker for AI agents. Issues chained together like beads.

3 0

Explore

stakpak/community-paks

dockerization

Official Stakpak application containerization standard operating procedure, a step-by-step guidline to properly dockerize applications. This is a rule book curated by the Stakpak Team.

3 0

Explore

stakpak/community-paks

simple-deployment-on-vm

How to do simple but secure deployments using virtual machines on different cloud providers

3 0

Explore

stakpak/community-paks

migrating-bitnami-to-bitnami-legacy

This rule book helps you migrate Bitnami Helm charts and container images from the bitnami repository to the bitnamilegacy repository. This migration is necessary due to Bitnami's transition, effective August 28th, 2025, where existing images will be moved to the legacy repo

3 0

Explore

stakpak/community-paks

cloudflare-tunnel-ec2-deployment

Deploy containerized applications to AWS EC2 and expose them publicly via Cloudflare Tunnel with automatic HTTPS. Eliminates need for load balancers, SSL certificates, or public inbound ports.

3 0

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

vLLM Model Serving and Inference

Quick Start

Docker (CPU)

Docker (GPU)

Docker Deployment

1. Assess Hardware Requirements

2. Pull the Container Image

3. Configure and Run

4. Verify Deployment

5. Update

Cloud VM Deployment

1. Provision Infrastructure

2. Connect and Deploy

3. Verify External Access

4. Cleanup

Configuration Reference

Environment Variables

Docker Flags

vLLM Arguments

API Endpoints

Production Checklist

Troubleshooting

References

Recommended Agent Skills

how-to-publish-paks

beads-issue-tracker

dockerization

simple-deployment-on-vm

migrating-bitnami-to-bitnami-legacy

cloudflare-tunnel-ec2-deployment