Agent skill
litellm
When calling LLM APIs from Python code. When connecting to llamafile or local LLM servers. When switching between OpenAI/Anthropic/local providers. When implementing retry/fallback logic for LLM calls. When code imports litellm or uses completion() patterns.
Install this agent skill to your Project
npx add-skill https://github.com/Jamie-BitFlight/claude_skills/tree/main/plugins/litellm/skills/litellm
SKILL.md
LiteLLM
Unified Python interface for calling 100+ LLM APIs using consistent OpenAI format. Provides standardized exception handling, retry/fallback logic, and cost tracking across multiple providers.
When to Use This Skill
Use this skill when:
- Integrating with multiple LLM providers through a single interface
- Routing requests to local llamafile servers using OpenAI-compatible endpoints
- Implementing retry and fallback logic for LLM calls
- Building applications requiring consistent error handling across providers
- Tracking LLM usage costs across different providers
- Converting between provider-specific APIs and OpenAI format
- Deploying LLM proxy servers with unified configuration
- Testing applications against both cloud and local LLM endpoints
Core Capabilities
Provider Support
LiteLLM supports 100+ providers through consistent OpenAI-style API:
- Cloud Providers: OpenAI, Anthropic, Google, Azure, AWS Bedrock
- Local Servers: llamafile, Ollama, LocalAI, vLLM
- Unified Format: All requests use OpenAI message format
- Exception Mapping: All provider errors map to OpenAI exception types
Key Features
- Unified API: Single
completion()function for all providers - Exception Handling: All exceptions inherit from OpenAI types
- Retry Logic: Built-in retry with configurable attempts
- Streaming Support: Sync and async streaming for all providers
- Cost Tracking: Automatic usage and cost calculation
- Proxy Mode: Deploy centralized LLM gateway
Installation
# Using pip
pip install litellm
# Using uv
uv add litellm
Llamafile Integration
Provider Configuration
All llamafile models MUST use the llamafile/ prefix for routing:
model = "llamafile/mistralai/mistral-7b-instruct-v0.2"
model = "llamafile/gemma-3-3b"
API Base URL
The api_base MUST point to llamafile's OpenAI-compatible endpoint:
api_base = "http://localhost:8080/v1"
Critical Requirements:
- Include
/v1suffix - Do NOT add endpoint paths like
/chat/completions(LiteLLM adds these automatically) - Default llamafile port is 8080
Environment Variable Configuration
import os
os.environ["LLAMAFILE_API_BASE"] = "http://localhost:8080/v1"
Basic Usage Patterns
Synchronous Completion
import litellm
response = litellm.completion(
model="llamafile/mistralai/mistral-7b-instruct-v0.2",
messages=[{"role": "user", "content": "Summarize this diff"}],
api_base="http://localhost:8080/v1",
temperature=0.2,
max_tokens=80,
)
print(response.choices[0].message.content)
Asynchronous Completion
from litellm import acompletion
import asyncio
async def generate_message():
response = await acompletion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Write a commit message"}],
api_base="http://localhost:8080/v1",
temperature=0.3,
max_tokens=200,
)
return response.choices[0].message.content
result = asyncio.run(generate_message())
print(result)
Async Streaming
from litellm import acompletion
import asyncio
async def stream_response():
response = await acompletion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello, how are you?"}],
api_base="http://localhost:8080/v1",
stream=True,
)
async for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
asyncio.run(stream_response())
Embeddings
from litellm import embedding
import os
os.environ["LLAMAFILE_API_BASE"] = "http://localhost:8080/v1"
response = embedding(
model="llamafile/sentence-transformers/all-MiniLM-L6-v2",
input=["Hello world"],
)
print(response)
Exception Handling
Import Pattern
All exceptions can be imported directly from litellm:
from litellm import (
BadRequestError, # 400 errors
AuthenticationError, # 401 errors
NotFoundError, # 404 errors
Timeout, # 408 errors (alias: openai.APITimeoutError)
RateLimitError, # 429 errors
APIConnectionError, # 500 errors / connection issues (default)
ServiceUnavailableError, # 503 errors
)
Exception Types Reference
| Status Code | Exception Type | Inherits from | Description |
|---|---|---|---|
| 400 | BadRequestError |
openai.BadRequestError | Invalid request |
| 400 | ContextWindowExceededError |
litellm.BadRequestError | Token limit exceeded |
| 400 | ContentPolicyViolationError |
litellm.BadRequestError | Content policy violation |
| 401 | AuthenticationError |
openai.AuthenticationError | Auth failure |
| 403 | PermissionDeniedError |
openai.PermissionDeniedError | Permission denied |
| 404 | NotFoundError |
openai.NotFoundError | Invalid model/endpoint |
| 408 | Timeout |
openai.APITimeoutError | Request timeout |
| 429 | RateLimitError |
openai.RateLimitError | Rate limited |
| 500 | APIConnectionError |
openai.APIConnectionError | Default for unmapped errors |
| 500 | APIError |
openai.APIError | Generic 500 error |
| 503 | ServiceUnavailableError |
openai.APIStatusError | Service unavailable |
| >=500 | InternalServerError |
openai.InternalServerError | Unmapped 500+ errors |
Exception Attributes
All LiteLLM exceptions include:
status_code: HTTP status codemessage: Error messagellm_provider: Provider that raised the exception
Exception Handling Example
import litellm
import openai
try:
response = litellm.completion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8080/v1",
timeout=30.0,
)
except openai.APITimeoutError as e:
# LiteLLM exceptions inherit from OpenAI types
print(f"Timeout: {e}")
except litellm.APIConnectionError as e:
print(f"Connection failed: {e.message}")
print(f"Provider: {e.llm_provider}")
Alternative Import from litellm.exceptions
from litellm.exceptions import BadRequestError, AuthenticationError, APIError
try:
response = litellm.completion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8080/v1",
)
except AuthenticationError as e:
print(f"Authentication failed: {e}")
except BadRequestError as e:
print(f"Bad request: {e}")
except APIError as e:
print(f"API error: {e}")
Checking If Exception Should Retry
import litellm
try:
response = litellm.completion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8080/v1",
)
except Exception as e:
if hasattr(e, 'status_code'):
should_retry = litellm._should_retry(e.status_code)
print(f"Should retry: {should_retry}")
Retry and Fallback Configuration
from litellm import completion
response = completion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8080/v1",
num_retries=3, # Retry 3 times on failure
timeout=30.0, # 30 second timeout
)
Proxy Server Configuration
For proxy deployments, use config.yaml:
model_list:
- model_name: commit-polish-model
litellm_params:
model: llamafile/gemma-3-3b # add llamafile/ prefix
api_base: http://localhost:8080/v1 # add api base for OpenAI compatible provider
Application Integration Patterns
Connection Verification Pattern
import litellm
from litellm import APIConnectionError
def verify_llamafile_connection(api_base: str = "http://localhost:8080/v1") -> bool:
"""Check if llamafile server is running."""
try:
litellm.completion(
model="llamafile/test",
messages=[{"role": "user", "content": "test"}],
api_base=api_base,
max_tokens=1,
)
return True
except APIConnectionError:
return False
Async Service Pattern
import litellm
from litellm import acompletion, APIConnectionError
import asyncio
class AIService:
"""LiteLLM wrapper with llamafile routing."""
def __init__(self, model: str, api_base: str, temperature: float = 0.3, max_tokens: int = 200):
self.model = model
self.api_base = api_base
self.temperature = temperature
self.max_tokens = max_tokens
async def generate_commit_message(self, diff: str, system_prompt: str) -> str:
"""Generate a commit message using the LLM."""
try:
response = await acompletion(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Generate a commit message for this diff:\n\n{diff}"},
],
api_base=self.api_base,
temperature=self.temperature,
max_tokens=self.max_tokens,
)
return response.choices[0].message.content.strip()
except APIConnectionError as e:
raise RuntimeError(f"Failed to connect to llamafile server at {self.api_base}: {e.message}")
Common Pitfalls to Avoid
- Missing
llamafile/prefix: Without prefix, LiteLLM won't route to OpenAI-compatible endpoint - Wrong port: Llamafile uses 8080 by default, not 8000
- Missing
/v1suffix: API base must end with/v1 - Adding extra path segments: Do NOT use
http://localhost:8080/v1/chat/completions- LiteLLM adds the endpoint path automatically - API key requirement: No API key needed for local llamafile (use empty string or any value if required by validation)
Configuration Examples
TOML Configuration
# ~/.config/commit-polish/config.toml
[ai]
model = "llamafile/gemma-3-3b" # MUST have llamafile/ prefix
temperature = 0.3
max_tokens = 200
Environment Variables
export LLAMAFILE_API_BASE="http://localhost:8080/v1"
export LITELLM_LOG="INFO" # Enable LiteLLM debug logging
Related Skills
For comprehensive documentation on related tools:
- llamafile: Activate the llamafile skill using
Skill(command: "llamafile:llamafile")for llamafile server setup, model management, and local LLM deployment patterns - uv: Activate the uv skill using
Skill(command: "python3-development:uv")for Python project management, dependency handling, and virtual environment workflows
References
Official Documentation
- LiteLLM Documentation - Main documentation portal
- Llamafile Provider Docs - Llamafile-specific configuration
- Exception Mapping - Complete exception reference
- GitHub Repository - Source code and examples
Provider-Specific Documentation
- Llamafile API Endpoints - Llamafile OpenAI-compatible API reference
- Completion Streaming - Streaming implementation guide
Version Information
- Documentation verified against: LiteLLM GitHub repository (main branch, accessed 2025-01-15)
- Python: 3.11+
- Llamafile: 0.9.3+
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
ccc
This skill should be used when code search is needed (whether explicitly requested or as part of completing a task), when indexing the codebase after changes, or when the user asks about ccc, cocoindex-code, or the codebase index. Trigger phrases include 'search the codebase', 'find code related to', 'update the index', 'ccc', 'cocoindex-code'.
agent-browser
Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.
delegate
Quick delegation template for sub-agent prompts. Use when assigning work to a sub-agent, before invoking the Agent tool, or when preparing prompts for specialized agents. Provides the WHERE-WHAT-WHY framework. For comprehensive delegation guidance, activate the agent-orchestration how-to-delegate skill.
swarm-spawning
Spawn agents and teammates in Claude Code swarms. Use when choosing between subagents vs teammates, selecting agent types (Explore, Plan, general-purpose, plugin agents), configuring spawn backends (in-process, tmux, iterm2), or setting environment variables for spawned agents.
knowledge-explorer
Manage the research/ knowledge base (KB) of tool and library research entries. Use when browsing KB topics, adding new research entries, updating existing entries with dated revisions, fetching GitHub repo metadata into a draft KB entry, or migrating old-format entries to skill-spec frontmatter. Triggers on tasks like "what do we have on X", "add this to the KB", "update the KB entry for Y", "fetch github info for owner/repo", or "migrate old entries".
design-anti-patterns
Enforce anti-AI UI design rules based on the Uncodixfy methodology. Use when generating HTML, CSS, React, Vue, Svelte, or any frontend UI code. Prevents "Codex UI" — the generic AI aesthetic of soft gradients, floating panels, oversized rounded corners, glassmorphism, hero sections in dashboards, and decorative copy. Applies constraints from Linear/Raycast/Stripe/GitHub design philosophy: functional, honest, human-designed interfaces. Triggers on: UI generation, dashboard building, frontend component creation, CSS styling, landing page design, or any task producing visual interface code.
Didn't find tool you were looking for?