Agent skill
llm-integration
LLM integration patterns for function calling, streaming responses, local inference with Ollama, and fine-tuning customization. Use when implementing tool use, SSE streaming, local model deployment, LoRA/QLoRA fine-tuning, or multi-provider LLM APIs.
Install this agent skill to your Project
npx add-skill https://github.com/yonatangross/orchestkit/tree/main/plugins/ork/skills/llm-integration
Metadata
Additional technical details for this skill
- category
- mcp-enhancement
SKILL.md
LLM Integration
Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in rules/ loaded on-demand.
Quick Reference
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Function Calling | 3 | CRITICAL | Tool definitions, parallel execution, input validation |
| Streaming | 3 | HIGH | SSE endpoints, structured streaming, backpressure handling |
| Local Inference | 3 | HIGH | Ollama setup, model selection, GPU optimization |
| Fine-Tuning | 3 | HIGH | LoRA/QLoRA training, dataset preparation, evaluation |
| Context Optimization | 2 | HIGH | Window management, compression, caching, budget scaling |
| Evaluation | 2 | HIGH | LLM-as-judge, RAGAS metrics, quality gates, benchmarks |
| Prompt Engineering | 4 | HIGH | CoT, few-shot, versioning, DSPy optimization, ReAct, cost optimization |
Total: 20 rules across 7 categories
Quick Start
# Function calling: strict mode tool definition
tools = [{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search knowledge base",
"strict": True,
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results"}
},
"required": ["query", "limit"],
"additionalProperties": False
}
}
}]
# Streaming: SSE endpoint with FastAPI
@app.get("/chat/stream")
async def stream_chat(prompt: str):
async def generate():
async for token in async_stream(prompt):
yield {"event": "token", "data": token}
yield {"event": "done", "data": ""}
return EventSourceResponse(generate())
# Local inference: Ollama with LangChain
llm = ChatOllama(
model="deepseek-r1:70b",
base_url="http://localhost:11434",
temperature=0.0,
num_ctx=32768,
)
# Fine-tuning: QLoRA with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B",
max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
Function Calling
Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.
calling-tool-definition.md-- Strict mode schemas, OpenAI/Anthropic formats, LangChain bindingcalling-parallel.md-- Parallel tool execution, asyncio.gather, strict mode constraintscalling-validation.md-- Input validation, error handling, tool execution loops
Streaming
Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.
streaming-sse.md-- FastAPI SSE endpoints, frontend consumers, async iteratorsstreaming-structured.md-- Streaming with tool calls, partial JSON parsing, chunk accumulationstreaming-backpressure.md-- Backpressure handling, bounded buffers, cancellation
Local Inference
Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.
local-ollama-setup.md-- Installation, model pulling, environment configurationlocal-model-selection.md-- Model comparison by task, hardware profiles, quantizationlocal-gpu-optimization.md-- Apple Silicon tuning, keep-alive, CI integration
Fine-Tuning
Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.
tuning-lora.md-- LoRA/QLoRA configuration, Unsloth training, adapter mergingtuning-dataset-prep.md-- Synthetic data generation, quality validation, deduplicationtuning-evaluation.md-- DPO alignment, evaluation metrics, anti-patterns
Context Optimization
Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.
context-window-management.md-- Five-layer architecture, anchored summarization, compression triggerscontext-caching.md-- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+
Evaluation
Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.
evaluation-metrics.md-- LLM-as-judge, RAGAS metrics, hallucination detectionevaluation-benchmarks.md-- Quality gates, batch evaluation, pairwise comparison
Prompt Engineering
Design, version, and optimize prompts for production LLM applications.
prompt-design.md-- Chain-of-Thought, few-shot learning, pattern selection guideprompt-testing.md-- Langfuse versioning, DSPy optimization, A/B testing, self-consistencyprompt-react-pattern.md-- ReAct loop for tool-using agents, thought-action-observation formatprompt-optimization.md-- Token reduction, cost optimization, model tiering, prompt spec format
Key Decisions
| Decision | Recommendation |
|---|---|
| Tool schema mode | strict: true (2026 best practice) |
| Tool count | 5-15 max per request |
| Streaming protocol | SSE for web, WebSocket for bidirectional |
| Buffer size | 50-200 tokens |
| Local model (reasoning) | deepseek-r1:70b |
| Local model (coding) | qwen2.5-coder:32b |
| Fine-tuning approach | LoRA/QLoRA (try prompting first) |
| LoRA rank | 16-64 typical |
| Training epochs | 1-3 (more risks overfitting) |
| Context compression | Anchored iterative (60-80%) |
| Compress trigger | 70% utilization, target 50% |
| Judge model | GPT-5.2-mini or Haiku 4.5 |
| Quality threshold | 0.7 production, 0.6 drafts |
| Few-shot examples | 3-5 diverse, representative |
| Prompt versioning | Langfuse with labels |
| Auto-optimization | DSPy MIPROv2 |
Related Skills
ork:rag-retrieval-- Embedding patterns, when RAG is better than fine-tuningagent-loops-- Multi-step tool use with reasoningllm-evaluation-- Evaluate fine-tuned and local modelslangfuse-observability-- Track training experiments
Capability Details
function-calling
Keywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools Solves:
- Define tools with clear descriptions and strict schemas
- Execute tool calls in parallel with asyncio.gather
- Validate inputs and handle errors in tool execution loops
streaming
Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream Solves:
- Stream LLM tokens via SSE endpoints
- Handle tool calls within streams
- Manage backpressure with bounded queues
local-inference
Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon Solves:
- Set up Ollama for local LLM inference
- Select models based on task and hardware
- Optimize GPU usage and CI integration
fine-tuning
Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment Solves:
- Configure LoRA/QLoRA for parameter-efficient training
- Generate and validate synthetic training data
- Align models with DPO and evaluate results
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
expect
Diff-aware AI browser testing — analyzes git changes, generates targeted test plans, and executes them via agent-browser. Reads git diff to determine what changed, maps changes to affected pages via route map, generates a test plan scoped to the diff, and runs it with pass/fail reporting. Use when testing UI changes, verifying PRs before merge, running regression checks on changed components, or validating that recent code changes don't break the user-facing experience.
github-operations
GitHub CLI operations for issues, PRs, milestones, and Projects v2. Covers gh commands, REST API patterns, and automation scripts. Use when managing GitHub issues, PRs, milestones, or Projects with gh.
chain-patterns
Chain patterns for CC 2.1.71 pipelines — MCP detection, handoff files, checkpoint-resume, worktree agents, CronCreate monitoring. Use when building multi-phase pipeline skills. Loaded via skills: field by pipeline skills (fix-issue, implement, brainstorm, verify). Not user-invocable.
storybook-mcp-integration
Storybook MCP server integration for component-aware AI development. Covers 6 tools across 3 toolsets (dev, docs, testing): component discovery via list-all-documentation/get-documentation, story previews via preview-stories, and automated testing via run-story-tests. Use when generating components that should reuse existing Storybook components, running component tests via MCP, or previewing stories in chat.
component-search
Search 21st.dev component registry for production-ready React components. Finds components by natural language description, filters by framework and style system, returns ranked results with install instructions. Use when looking for UI components, finding alternatives to existing components, or sourcing design system building blocks.
ai-ui-generation
AI-assisted UI generation patterns for json-render, v0, Bolt, and Cursor workflows. Covers prompt engineering for component generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.
Didn't find tool you were looking for?