Agent skill

llm-integration

LLM integration patterns for function calling, streaming responses, local inference with Ollama, and fine-tuning customization. Use when implementing tool use, SSE streaming, local model deployment, LoRA/QLoRA fine-tuning, or multi-provider LLM APIs.

View SKILL.md on GitHub Repository

Stars 143

Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/yonatangross/orchestkit/tree/main/plugins/ork/skills/llm-integration

Metadata

Additional technical details for this skill

category: mcp-enhancement

SKILL.md

LLM Integration

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in rules/ loaded on-demand.

Quick Reference

Category	Rules	Impact	When to Use
Function Calling	3	CRITICAL	Tool definitions, parallel execution, input validation
Streaming	3	HIGH	SSE endpoints, structured streaming, backpressure handling
Local Inference	3	HIGH	Ollama setup, model selection, GPU optimization
Fine-Tuning	3	HIGH	LoRA/QLoRA training, dataset preparation, evaluation
Context Optimization	2	HIGH	Window management, compression, caching, budget scaling
Evaluation	2	HIGH	LLM-as-judge, RAGAS metrics, quality gates, benchmarks
Prompt Engineering	4	HIGH	CoT, few-shot, versioning, DSPy optimization, ReAct, cost optimization

Total: 20 rules across 7 categories

Quick Start

python

# Function calling: strict mode tool definition
tools = [{
    "type": "function",
    "function": {
        "name": "search_documents",
        "description": "Search knowledge base",
        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "limit": {"type": "integer", "description": "Max results"}
            },
            "required": ["query", "limit"],
            "additionalProperties": False
        }
    }
}]

python

# Streaming: SSE endpoint with FastAPI
@app.get("/chat/stream")
async def stream_chat(prompt: str):
    async def generate():
        async for token in async_stream(prompt):
            yield {"event": "token", "data": token}
        yield {"event": "done", "data": ""}
    return EventSourceResponse(generate())

python

# Local inference: Ollama with LangChain
llm = ChatOllama(
    model="deepseek-r1:70b",
    base_url="http://localhost:11434",
    temperature=0.0,
    num_ctx=32768,
)

python

# Fine-tuning: QLoRA with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)

Function Calling

Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.

calling-tool-definition.md -- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding
calling-parallel.md -- Parallel tool execution, asyncio.gather, strict mode constraints
calling-validation.md -- Input validation, error handling, tool execution loops

Streaming

Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.

streaming-sse.md -- FastAPI SSE endpoints, frontend consumers, async iterators
streaming-structured.md -- Streaming with tool calls, partial JSON parsing, chunk accumulation
streaming-backpressure.md -- Backpressure handling, bounded buffers, cancellation

Local Inference

Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.

local-ollama-setup.md -- Installation, model pulling, environment configuration
local-model-selection.md -- Model comparison by task, hardware profiles, quantization
local-gpu-optimization.md -- Apple Silicon tuning, keep-alive, CI integration

Fine-Tuning

Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.

tuning-lora.md -- LoRA/QLoRA configuration, Unsloth training, adapter merging
tuning-dataset-prep.md -- Synthetic data generation, quality validation, deduplication
tuning-evaluation.md -- DPO alignment, evaluation metrics, anti-patterns

Context Optimization

Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.

context-window-management.md -- Five-layer architecture, anchored summarization, compression triggers
context-caching.md -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+

Evaluation

Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.

evaluation-metrics.md -- LLM-as-judge, RAGAS metrics, hallucination detection
evaluation-benchmarks.md -- Quality gates, batch evaluation, pairwise comparison

Prompt Engineering

Design, version, and optimize prompts for production LLM applications.

prompt-design.md -- Chain-of-Thought, few-shot learning, pattern selection guide
prompt-testing.md -- Langfuse versioning, DSPy optimization, A/B testing, self-consistency
prompt-react-pattern.md -- ReAct loop for tool-using agents, thought-action-observation format
prompt-optimization.md -- Token reduction, cost optimization, model tiering, prompt spec format

Key Decisions

Decision	Recommendation
Tool schema mode	`strict: true` (2026 best practice)
Tool count	5-15 max per request
Streaming protocol	SSE for web, WebSocket for bidirectional
Buffer size	50-200 tokens
Local model (reasoning)	`deepseek-r1:70b`
Local model (coding)	`qwen2.5-coder:32b`
Fine-tuning approach	LoRA/QLoRA (try prompting first)
LoRA rank	16-64 typical
Training epochs	1-3 (more risks overfitting)
Context compression	Anchored iterative (60-80%)
Compress trigger	70% utilization, target 50%
Judge model	GPT-5.2-mini or Haiku 4.5
Quality threshold	0.7 production, 0.6 drafts
Few-shot examples	3-5 diverse, representative
Prompt versioning	Langfuse with labels
Auto-optimization	DSPy MIPROv2

Related Skills

ork:rag-retrieval -- Embedding patterns, when RAG is better than fine-tuning
agent-loops -- Multi-step tool use with reasoning
llm-evaluation -- Evaluate fine-tuned and local models
langfuse-observability -- Track training experiments

Capability Details

function-calling

Keywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools Solves:

Define tools with clear descriptions and strict schemas
Execute tool calls in parallel with asyncio.gather
Validate inputs and handle errors in tool execution loops

streaming

Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream Solves:

Stream LLM tokens via SSE endpoints
Handle tool calls within streams
Manage backpressure with bounded queues

local-inference

Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon Solves:

Set up Ollama for local LLM inference
Select models based on task and hardware
Optimize GPU usage and CI integration

fine-tuning

Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment Solves:

Configure LoRA/QLoRA for parameter-efficient training
Generate and validate synthetic training data
Align models with DPO and evaluate results

Maintainer

yonatangross Core maintainer

Source details

Full Name: yonatangross/orchestkit
Branch: main
Path in repo: plugins/ork/skills/llm-integration
License: MIT License
Topics: claude-code mcp typescript agents llm react ai-development security rag langgraph testing claude-plugin fastapi

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

yonatangross/orchestkit

expect

Diff-aware AI browser testing — analyzes git changes, generates targeted test plans, and executes them via agent-browser. Reads git diff to determine what changed, maps changes to affected pages via route map, generates a test plan scoped to the diff, and runs it with pass/fail reporting. Use when testing UI changes, verifying PRs before merge, running regression checks on changed components, or validating that recent code changes don't break the user-facing experience.

143 15

Explore

yonatangross/orchestkit

github-operations

GitHub CLI operations for issues, PRs, milestones, and Projects v2. Covers gh commands, REST API patterns, and automation scripts. Use when managing GitHub issues, PRs, milestones, or Projects with gh.

143 15

Explore

yonatangross/orchestkit

chain-patterns

Chain patterns for CC 2.1.71 pipelines — MCP detection, handoff files, checkpoint-resume, worktree agents, CronCreate monitoring. Use when building multi-phase pipeline skills. Loaded via skills: field by pipeline skills (fix-issue, implement, brainstorm, verify). Not user-invocable.

143 15

Explore

yonatangross/orchestkit

storybook-mcp-integration

Storybook MCP server integration for component-aware AI development. Covers 6 tools across 3 toolsets (dev, docs, testing): component discovery via list-all-documentation/get-documentation, story previews via preview-stories, and automated testing via run-story-tests. Use when generating components that should reuse existing Storybook components, running component tests via MCP, or previewing stories in chat.

143 15

Explore

yonatangross/orchestkit

component-search

Search 21st.dev component registry for production-ready React components. Finds components by natural language description, filters by framework and style system, returns ranked results with install instructions. Use when looking for UI components, finding alternatives to existing components, or sourcing design system building blocks.

143 15

Explore

yonatangross/orchestkit

ai-ui-generation

AI-assisted UI generation patterns for json-render, v0, Bolt, and Cursor workflows. Covers prompt engineering for component generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.

143 15

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

LLM Integration

Quick Reference

Quick Start

Function Calling

Streaming

Local Inference

Fine-Tuning

Context Optimization

Evaluation

Prompt Engineering

Key Decisions

Related Skills

Capability Details

function-calling

streaming

local-inference

fine-tuning

Recommended Agent Skills

expect

github-operations

chain-patterns

storybook-mcp-integration

component-search

ai-ui-generation