Agent skill
embedding-optimization
Optimizing vector embeddings for RAG systems through model selection, chunking strategies, caching, and performance tuning. Use when building semantic search, RAG pipelines, or document retrieval systems that require cost-effective, high-quality embeddings.
Install this agent skill to your Project
npx add-skill https://github.com/ancoleman/ai-design-components/tree/main/skills/embedding-optimization
SKILL.md
Embedding Optimization
Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.
When to Use This Skill
Trigger this skill when:
- Building RAG (Retrieval Augmented Generation) systems
- Implementing semantic search or similarity detection
- Optimizing embedding API costs (reducing by 70-90%)
- Improving document retrieval quality through better chunking
- Processing large document corpora (thousands to millions of documents)
- Selecting between API-based vs. local embedding models
Model Selection Framework
Choose the optimal embedding model based on requirements:
Quick Recommendations:
- Startup/MVP:
all-MiniLM-L6-v2(local, 384 dims, zero API costs) - Production:
text-embedding-3-small(API, 1,536 dims, balanced quality/cost) - High Quality:
text-embedding-3-large(API, 3,072 dims, premium) - Multilingual:
multilingual-e5-base(local, 768 dims) or Cohereembed-multilingual-v3.0
For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see references/model-selection-guide.md.
Model Comparison Summary:
| Model | Type | Dimensions | Cost per 1M tokens | Best For |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | Local | 384 | $0 (compute only) | High volume, tight budgets |
| BGE-base-en-v1.5 | Local | 768 | $0 (compute only) | Quality + cost balance |
| text-embedding-3-small | API | 1,536 | $0.02 | General purpose production |
| text-embedding-3-large | API | 3,072 | $0.13 | Premium quality requirements |
| embed-multilingual-v3.0 | API | 1,024 | $0.10 | 100+ language support |
Chunking Strategies
Select chunking strategy based on content type and use case:
Content Type → Strategy Mapping:
- Documentation: Recursive (heading-aware), 800 chars, 100 overlap
- Code: Recursive (function-level), 1,000 chars, 100 overlap
- Q&A/FAQ: Fixed-size, 500 chars, 50 overlap (precise retrieval)
- Legal/Technical: Semantic (large), 1,500 chars, 200 overlap (context preservation)
- Blog Posts: Semantic (paragraph), 1,000 chars, 100 overlap
- Academic Papers: Recursive (section-aware), 1,200 chars, 150 overlap
For detailed chunking patterns, decision trees, and implementation guidance, see references/chunking-strategies.md.
Quick Start with CLI:
python scripts/chunk_document.py \
--input document.txt \
--content-type markdown \
--chunk-size 800 \
--overlap 100 \
--output chunks.jsonl
Caching Implementation
Achieve 80-90% cost reduction through content-addressable caching.
Caching Architecture by Query Volume:
- <10K queries/month: In-memory cache (Python
lru_cache) - 10K-100K queries/month: Redis (fast, TTL-based expiration)
- 100K-1M queries/month: Redis (hot) + PostgreSQL (warm)
- >1M queries/month: Multi-tier (Redis + PostgreSQL + S3)
Production Caching with Redis:
# Embed documents with caching enabled
python scripts/cached_embedder.py \
--model text-embedding-3-small \
--input documents.jsonl \
--output embeddings.npy \
--cache-backend redis \
--cache-ttl 2592000 # 30 days
Caching ROI Example:
- 50,000 document chunks
- 20% duplicate content
- Without caching: $0.50 API cost
- With caching (60% hit rate): $0.20 API cost
- Savings: 60% ($0.30)
Dimensionality Trade-offs
Balance storage, search speed, and quality:
| Dimensions | Storage (1M vectors) | Search Speed (p95) | Quality | Use Case |
|---|---|---|---|---|
| 384 | 1.5 GB | 10ms | Good | Large-scale search |
| 768 | 3 GB | 15ms | High | General purpose RAG |
| 1,536 | 6 GB | 25ms | Very High | High-quality retrieval |
| 3,072 | 12 GB | 40ms | Highest | Premium applications |
Key Insight: For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.
Batch Processing Optimization
Maximize throughput for large-scale ingestion:
OpenAI API:
- Batch up to 2,048 inputs per request
- Implement rate limiting (tier-dependent: 500-5,000 RPM)
- Use parallel requests with backoff on rate limits
Local Models (sentence-transformers):
- GPU acceleration (CUDA, MPS for Apple Silicon)
- Batch size tuning (32-128 based on GPU memory)
- Multi-GPU support for maximum throughput
Expected Throughput:
- OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
- Local GPU (RTX 3090): 5,000-10,000 texts/minute
- Local CPU: 100-500 texts/minute
Performance Monitoring
Track key metrics for optimization:
Critical Metrics:
- Latency: Embedding generation time (p50, p95, p99)
- Throughput: Embeddings per second/minute
- Cost: API usage tracking (USD per 1K/1M tokens)
- Cache Efficiency: Hit rate percentage
For detailed monitoring setup, metric collection patterns, and dashboarding, see references/performance-monitoring.md.
Monitor with Wrapper:
from scripts.performance_monitor import MonitoredEmbedder
monitored = MonitoredEmbedder(
embedder=your_embedder,
cost_per_1k_tokens=0.00002 # OpenAI pricing
)
embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
print(f"Total cost: ${metrics['total_cost_usd']}")
Working Examples
See examples/ directory for complete implementations:
Python Examples:
examples/openai_cached.py- OpenAI embeddings with Redis cachingexamples/local_embedder.py- sentence-transformers local embeddingexamples/smart_chunker.py- Content-aware recursive chunkingexamples/performance_monitor.py- Pipeline performance trackingexamples/batch_processor.py- Large-scale document processing
All examples include:
- Complete, runnable code
- Dependency installation instructions
- Error handling and retry logic
- Configuration options
Integration Points
Upstream (This skill provides to):
- Vector Databases: Embeddings flow to Pinecone, Weaviate, Qdrant, pgvector
- RAG Systems: Optimized embeddings for retrieval pipelines
- Semantic Search: Query and document embeddings for similarity search
Downstream (This skill uses from):
- Document Processing: Chunk documents before embedding
- Data Ingestion: Process documents from various sources
Related Skills:
- For RAG architecture, see
building-ai-chatskill - For vector database operations, see
databases-vectorskill - For data ingestion pipelines, see
ingesting-dataskill
Common Patterns
Pattern 1: RAG Pipeline
Document → Chunk → Embed → Store (vector DB) → Retrieve
Pattern 2: Semantic Search
Query → Embed → Search (vector DB) → Rank → Display
Pattern 3: Multi-Stage Retrieval (Cost Optimization)
Query → Cheap Embedding (384d) → Initial Search →
Expensive Embedding (1,536d) → Rerank Top-K → Return
Cost Savings: 70% reduction vs. single-stage with expensive embeddings
Quick Reference Checklist
Model Selection:
- Identified data privacy requirements (local vs. API)
- Calculated expected query volume
- Determined quality requirements (good/high/highest)
- Checked multilingual support needs
Chunking:
- Analyzed content type (code, docs, legal, etc.)
- Selected appropriate chunk size (500-1,500 chars)
- Set overlap to prevent context loss (50-200 chars)
- Validated chunks preserve semantic boundaries
Caching:
- Implemented content-addressable hashing
- Selected cache backend (Redis, PostgreSQL)
- Set TTL based on content volatility
- Monitoring cache hit rate (target: >60%)
Performance:
- Tracking latency (embedding generation time)
- Measuring throughput (embeddings/sec)
- Monitoring costs (USD spent on API calls)
- Optimizing batch sizes for maximum efficiency
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
designing-sdks
Design production-ready SDKs with retry logic, error handling, pagination, and multi-language support. Use when building client libraries for APIs or creating developer-facing SDK interfaces.
administering-linux
Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.
implementing-api-patterns
API design and implementation across REST, GraphQL, gRPC, and tRPC patterns. Use when building backend services, public APIs, or service-to-service communication. Covers REST frameworks (FastAPI, Axum, Gin, Hono), GraphQL libraries (Strawberry, async-graphql, gqlgen, Pothos), gRPC (Tonic, Connect-Go), tRPC for TypeScript, pagination strategies (cursor-based, offset-based), rate limiting, caching, versioning, and OpenAPI documentation generation. Includes frontend integration patterns for forms, tables, dashboards, and ai-chat skills.
prompt-engineering
Engineer effective LLM prompts using zero-shot, few-shot, chain-of-thought, and structured output techniques. Use when building LLM applications requiring reliable outputs, implementing RAG systems, creating AI agents, or optimizing prompt quality and cost. Covers OpenAI, Anthropic, and open-source models with multi-language examples (Python/TypeScript).
deploying-applications
Deployment patterns from Kubernetes to serverless and edge functions. Use when deploying applications, setting up CI/CD, or managing infrastructure. Covers Kubernetes (Helm, ArgoCD), serverless (Vercel, Lambda), edge (Cloudflare Workers, Deno), IaC (Pulumi, OpenTofu, SST), and GitOps patterns.
optimizing-costs
Optimize cloud infrastructure costs through FinOps practices, commitment discounts, right-sizing, and automated cost management. Use when reducing cloud spend, implementing budget controls, or establishing cost visibility across AWS, Azure, GCP, and Kubernetes environments.
Didn't find tool you were looking for?