Agent skill

rag-architect

Designs production-grade RAG pipelines with chunking optimization, retrieval evaluation, and pipeline architecture. Use when building a RAG system, selecting a chunking strategy, choosing a vector database, optimizing retrieval quality, designing embedding pipelines, or evaluating RAG performance with RAGAS metrics.

Stars 71
Forks 21

Install this agent skill to your Project

npx add-skill https://github.com/borghei/Claude-Skills/tree/main/engineering/rag-architect

Metadata

Additional technical details for this skill

tier
POWERFUL
author
borghei
domain
ai-ml
updated
1774915200
version
1.0.0
category
engineering

SKILL.md

RAG Architect

The agent designs, implements, and optimizes production-grade Retrieval-Augmented Generation pipelines, covering the full lifecycle from document chunking through evaluation.

Workflow

  1. Analyse corpus -- Profile the document collection: count, average length, format mix (PDF, HTML, Markdown), language(s), and domain. Validate that sample documents are accessible before proceeding.
  2. Select chunking strategy -- Choose from the Chunking Strategy Matrix based on corpus characteristics. Set chunk size, overlap, and boundary rules. Run a test split on 100 sample documents.
  3. Choose embedding model -- Select an embedding model from the Embedding Model table based on domain, latency budget, and cost constraints. Verify dimension compatibility with the target vector database.
  4. Select vector database -- Pick a vector store from the Vector Database Comparison based on scale, query patterns, and operational requirements.
  5. Design retrieval pipeline -- Configure retrieval strategy (dense, sparse, or hybrid). Add reranking if precision requirements exceed 0.85. Set the top-K parameter and similarity threshold.
  6. Implement query transformations -- If query-document style mismatch exists, enable HyDE. If queries are ambiguous, enable multi-query generation. Validate each transformation improves retrieval metrics on a held-out set.
  7. Configure guardrails -- Enable PII detection, toxicity filtering, hallucination detection, and source attribution. Set confidence score thresholds.
  8. Evaluate end-to-end -- Run the RAGAS evaluation framework. Verify faithfulness > 0.90, context relevance > 0.80, answer relevance > 0.85. Iterate on weak components.

Chunking Strategy Matrix

Strategy Best For Chunk Size Overlap Pros Cons
Fixed-size (token) Uniform docs, consistent sizing 512-2048 tokens 10-20% Predictable, simple Breaks semantic units
Sentence-based Narrative text, articles 3-8 sentences 1 sentence Preserves language boundaries Variable sizes
Paragraph-based Structured docs, technical manuals 1-3 paragraphs 0-1 paragraph Preserves topic coherence Highly variable sizes
Semantic Long-form, research papers Dynamic Topic-shift detection Best coherence Computationally expensive
Recursive Mixed content types Dynamic, multi-level Per-level Optimal utilization Complex implementation
Document-aware Multi-format collections Format-specific Section-level Preserves metadata Format-specific code required

Embedding Model Comparison

Model Dimensions Speed Quality Cost Best For
all-MiniLM-L6-v2 384 ~14K tok/s Good Free (local) Prototyping, low-latency
all-mpnet-base-v2 768 ~2.8K tok/s Better Free (local) Balanced production use
text-embedding-3-small 1536 API High $0.02/1M tokens Cost-effective production
text-embedding-3-large 3072 API Highest $0.13/1M tokens Maximum quality
Domain fine-tuned Varies Varies Domain-best Training cost Specialized domains (legal, medical)

Vector Database Comparison

Database Type Scaling Key Feature Best For
Pinecone Managed Auto-scaling Metadata filtering, hybrid search Production, managed preference
Weaviate Open source Horizontal GraphQL API, multi-modal Complex data types
Qdrant Open source Distributed High perf, low memory (Rust) Performance-critical
Chroma Embedded Limited Simple API, SQLite-backed Prototyping, small-scale
pgvector PostgreSQL ext PostgreSQL scaling ACID, SQL joins Existing PostgreSQL infra

Retrieval Strategies

Strategy When to Use Implementation
Dense (vector similarity) Default for semantic search Cosine similarity with k-NN/ANN
Sparse (BM25/TF-IDF) Exact keyword matching needed Elasticsearch or inverted index
Hybrid (dense + sparse) Best of both needed Reciprocal Rank Fusion (RRF) with tuned weights
+ Reranking Precision must exceed 0.85 Cross-encoder reranker after initial retrieval

Query Transformation Techniques

Technique When to Use How It Works
HyDE Query/document style mismatch LLM generates hypothetical answer; embed that instead of query
Multi-query Ambiguous queries Generate 3-5 query variations; retrieve for each; deduplicate
Step-back Specific questions needing general context Transform to broader query; retrieve general + specific

Context Window Optimization

  • Relevance ordering: Most relevant chunks first in the context window
  • Diversity: Deduplicate semantically similar chunks
  • Token budget: Fit within model context limit; reserve tokens for system prompt and answer
  • Hierarchical inclusion: Include section summary before detailed chunks when available
  • Compression: Summarize low-relevance chunks; extract key facts from verbose passages

Evaluation Metrics (RAGAS Framework)

Metric Target What It Measures
Faithfulness > 0.90 Answers grounded in retrieved context
Context Relevance > 0.80 Retrieved chunks relevant to query
Answer Relevance > 0.85 Answer addresses the original question
Precision@K > 0.70 % of top-K results that are relevant
Recall@K > 0.80 % of relevant docs found in top-K
MRR > 0.75 Reciprocal rank of first relevant result

Guardrails

  • PII detection: Scan retrieved chunks and generated responses for PII; redact or block
  • Hallucination detection: Compare generated claims against source documents via NLI
  • Source attribution: Every factual claim must cite a retrieved chunk
  • Confidence scoring: Return confidence level; if below threshold, return "I don't have enough information"
  • Injection prevention: Sanitize user queries; reject prompt injection attempts

Example: Internal Knowledge Base RAG Pipeline

yaml
corpus:
  documents: 12,000 Confluence pages + 3,000 PDFs
  avg_length: 2,400 tokens
  languages: [English]
  domain: internal engineering docs

pipeline:
  chunking:
    strategy: recursive
    max_tokens: 512
    overlap: 50 tokens
    boundary: paragraph
  embedding:
    model: text-embedding-3-small
    dimensions: 1536
    batch_size: 100
  vector_db:
    engine: pgvector
    index: HNSW (ef_construction=128, m=16)
    reason: "Existing PostgreSQL infra; ACID compliance for audit"
  retrieval:
    strategy: hybrid
    dense_weight: 0.7
    sparse_weight: 0.3
    top_k: 10
    reranker: cross-encoder/ms-marco-MiniLM-L-12-v2
    final_k: 5

evaluation_results:
  faithfulness: 0.93
  context_relevance: 0.84
  answer_relevance: 0.88
  precision_at_5: 0.76
  recall_at_10: 0.85

Production Patterns

  • Caching: Query-level (exact match), semantic (similar queries via embedding distance < 0.05), chunk-level (embedding cache)
  • Streaming: Stream generation tokens while retrieval completes; show sources after generation
  • Fallbacks: If primary vector DB is unavailable, serve from read-replica; if retrieval returns no results above threshold, say so explicitly
  • Document refresh: Incremental re-embedding on change detection; full re-index weekly
  • Cost control: Batch embeddings, cache aggressively, route simple queries to BM25 only

Common Pitfalls

Problem Solution
Chunks break mid-sentence Use boundary-aware chunking with sentence/paragraph overlap
Low retrieval precision Add cross-encoder reranker; tune similarity threshold
High latency (> 2s) Cache embeddings; use faster model; reduce top-K
Inconsistent quality Implement RAGAS evaluation in CI; add quality scoring
Scalability bottleneck Shard vector DB; implement auto-scaling; add caching layer

Scripts

Chunking Optimizer

Analyses corpus and recommends optimal chunking strategy with parameters.

Retrieval Evaluator

Runs evaluation suite (precision, recall, MRR, NDCG) against a test query set.

Pipeline Benchmarker

Measures end-to-end latency, throughput, and cost per query across configurations.

Troubleshooting

Problem Cause Solution
Chunks contain incomplete sentences or broken code blocks Fixed-size chunking ignoring semantic boundaries Switch to sentence-based or semantic (heading-aware) chunking; enable boundary detection in chunking_optimizer.py
Retrieved context is relevant but answer is wrong LLM hallucinating beyond retrieved chunks Enable faithfulness evaluation via RAGAS; add source attribution guardrails; lower confidence threshold to surface "I don't know" responses
Precision@K below 0.50 despite relevant documents existing Embedding model does not capture domain vocabulary Fine-tune embedding model on domain data or switch to a domain-specific model; add cross-encoder reranking stage
Query latency exceeds 2 seconds Large top-K, no caching, or unoptimized HNSW index Reduce top-K, enable query-level and semantic caching, tune HNSW parameters (ef_search, m)
Recall drops after adding new documents Stale embeddings or index fragmentation after incremental inserts Trigger full re-index; verify new documents pass chunking pipeline; check embedding model version consistency
Hybrid retrieval returns duplicate chunks Dense and sparse retrievers returning overlapping results without deduplication Apply Reciprocal Rank Fusion (RRF) with deduplication before reranking; tune dense/sparse weight ratio
Evaluation metrics fluctuate across runs Non-deterministic embedding batching or insufficient test query set Fix random seeds, increase evaluation sample size, run evaluations on a frozen ground-truth set

Success Criteria

  • Faithfulness > 0.90 -- Generated answers are grounded in retrieved context as measured by the RAGAS faithfulness metric.
  • Context Relevance > 0.80 -- At least 80% of retrieved chunks are relevant to the user query.
  • Precision@5 > 0.70 -- Seven out of ten top-5 result sets contain only relevant documents.
  • End-to-end latency < 500ms -- P95 query-to-response latency stays under 500 milliseconds for interactive workloads.
  • Recall@10 > 0.85 -- The system retrieves at least 85% of relevant documents within the top 10 results.
  • Chunk boundary quality > 0.80 -- At least 80% of chunks end on clean sentence or paragraph boundaries as reported by chunking_optimizer.py.
  • Monthly cost within budget -- Total embedding, vector DB, and reranking costs stay within the budget ceiling defined in requirements.

Scope & Limitations

This skill covers:

  • End-to-end RAG pipeline architecture design: chunking, embedding, vector storage, retrieval, reranking, and evaluation.
  • Quantitative chunking analysis across four strategy families (fixed-size, sentence, paragraph, semantic).
  • Retrieval quality evaluation using standard IR metrics (Precision@K, Recall@K, MRR, NDCG) with a built-in TF-IDF baseline.
  • Automated pipeline design with component selection, cost projection, and Mermaid architecture diagrams.

This skill does NOT cover:

  • LLM prompt engineering or generation-side optimization -- see engineering/prompt-engineer-toolkit.
  • Database schema design for metadata stores alongside vector databases -- see engineering/database-designer.
  • Production observability, alerting, and SLO dashboards for deployed pipelines -- see engineering/observability-designer.
  • Agent orchestration or multi-step reasoning workflows that sit on top of RAG retrieval -- see engineering/agent-workflow-designer.

Integration Points

Skill Integration Data Flow
engineering/prompt-engineer-toolkit Optimize system prompts and few-shot examples fed alongside retrieved chunks Pipeline design output --> prompt templates that reference chunk format and metadata
engineering/database-designer Design relational metadata stores (tags, access control, source tracking) paired with the vector database Vector DB recommendation --> metadata schema for hybrid storage
engineering/observability-designer Set up latency, throughput, and accuracy monitoring for the deployed RAG pipeline Evaluation metrics and SLO targets --> dashboards and alerting rules
engineering/agent-workflow-designer Embed the RAG retrieval step inside multi-agent reasoning workflows Retrieval config --> agent tool definition with top-K and threshold parameters
engineering/ci-cd-pipeline-builder Automate embedding re-indexing, evaluation regression tests, and deployment on document changes Evaluation thresholds --> CI gate that blocks deploys when metrics regress
engineering/api-design-reviewer Review the query and ingestion API surface exposed by the RAG service Pipeline config --> OpenAPI spec review for search and ingest endpoints

Tool Reference

chunking_optimizer.py

Purpose: Analyzes a document corpus and evaluates multiple chunking strategies (fixed-size, sentence-based, paragraph-based, semantic/heading-aware) to recommend the optimal approach with configuration parameters.

Usage:

bash
python chunking_optimizer.py <directory> [options]

Flags / Parameters:

Flag Type Default Description
directory positional, required -- Directory containing text/markdown documents to analyze
--output, -o string None Output file path for results in JSON format
--config, -c string None JSON configuration file to customize strategy parameters (fixed_sizes, overlaps, sentence_max_sizes, paragraph_max_sizes, semantic_max_sizes)
--extensions string list .txt .md .markdown File extensions to include when scanning the corpus
--verbose, -v flag off Print all strategy scores in addition to the recommendation

Example:

bash
python chunking_optimizer.py ./docs --output results.json --extensions .txt .md --verbose

Output Formats:

  • Console -- Corpus summary, recommended strategy name, performance score, reasoning text, and two sample chunks. With --verbose, all strategy scores are listed.
  • JSON (--output) -- Full results object containing corpus_info, strategy_results (per-strategy size statistics, boundary quality, semantic coherence, vocabulary statistics, performance score), recommendation (best strategy, all scores, reasoning), and sample_chunks.

retrieval_evaluator.py

Purpose: Evaluates retrieval system performance using a built-in TF-IDF baseline retriever and standard information retrieval metrics: Precision@K, Recall@K, MRR, and NDCG. Includes failure analysis and improvement recommendations.

Usage:

bash
python retrieval_evaluator.py <queries> <corpus> <ground_truth> [options]

Flags / Parameters:

Flag Type Default Description
queries positional, required -- JSON file containing queries (list of {"id": ..., "query": ...} objects, or {"queries": [...]})
corpus positional, required -- Directory containing the document corpus
ground_truth positional, required -- JSON file mapping query IDs to lists of relevant document IDs
--output, -o string None Output file path for results in JSON format
--k-values int list 1 3 5 10 K values used when computing Precision@K, Recall@K, and NDCG@K
--extensions string list .txt .md .markdown File extensions to include from the corpus directory
--verbose, -v flag off Print detailed per-metric values and failure analysis counts

Example:

bash
python retrieval_evaluator.py queries.json ./corpus ground_truth.json --output eval.json --k-values 1 5 10 --verbose

Output Formats:

  • Console -- Evaluation summary table (Precision@1, Precision@5, Recall@5, MRR, NDCG@5) with performance assessment and numbered improvement recommendations. With --verbose, all aggregate metrics and failure analysis counts are printed.
  • JSON (--output) -- Full results object containing aggregate_metrics, query_results (per-query metrics, retrieved docs, relevant docs), failure_analysis (poor precision/recall counts, zero-result counts, query length analysis, failure patterns), evaluation_summary, and recommendations.

rag_pipeline_designer.py

Purpose: Accepts a system requirements specification and generates a complete RAG pipeline design including component recommendations (chunking, embedding, vector DB, retrieval, reranking, evaluation), cost projections, a Mermaid architecture diagram, and deployment configuration templates.

Usage:

bash
python rag_pipeline_designer.py <requirements> [options]

Flags / Parameters:

Flag Type Default Description
requirements positional, required -- JSON file containing system requirements (document_types, document_count, avg_document_size, queries_per_day, query_patterns, latency_requirement, budget_monthly, accuracy_priority, cost_priority, maintenance_complexity)
--output, -o string None Output file path for the pipeline design in JSON format
--verbose, -v flag off Print full configuration templates for each component

Example:

bash
python rag_pipeline_designer.py requirements.json --output pipeline_design.json --verbose

Output Formats:

  • Console -- Design summary with total monthly cost, per-component recommendations (name, rationale, cost), and a Mermaid architecture diagram. With --verbose, full JSON configuration templates for each component are printed.
  • JSON (--output) -- Complete pipeline design object containing per-component ComponentRecommendation fields (name, type, config, rationale, pros, cons, cost_monthly), total_cost, architecture_diagram (Mermaid markup), and config_templates (per-component configs plus deployment/scaling/monitoring settings).

Expand your agent's capabilities with these related and highly-rated skills.

borghei/Claude-Skills

churn-prevention

SaaS churn reduction covering cancel flow design, dynamic save offers, exit survey architecture, dunning sequences, payment recovery, win-back campaigns, and churn impact modeling.

71 21
Explore
borghei/Claude-Skills

popup-cro

Popup and modal optimization for conversion. Covers exit-intent, slide-ins, banners, timing optimization, frequency capping, audience targeting, compliance, and A/B testing frameworks for lead capture, promotions, and announcements.

71 21
Explore
borghei/Claude-Skills

competitor-alternatives

Competitor comparison and alternative page creation for SEO and sales enablement. Covers 4 page formats (singular alternative, plural alternatives, vs pages, competitor vs competitor), content architecture, research methodology, and centralized competitor data management.

71 21
Explore
borghei/Claude-Skills

contract-and-proposal-writer

Generate production-ready business documents including freelance contracts, project proposals, SOWs, NDAs, and MSAs with jurisdiction-aware clauses. Covers US (Delaware), EU (GDPR), UK, and DACH (German law) legal frameworks. Includes contract templates, clause libraries, and DOCX conversion. Use when starting client engagements, writing proposals, drafting partnership agreements, or needing GDPR-compliant data processing addenda.

71 21
Explore
borghei/Claude-Skills

pricing-strategy

SaaS pricing design and optimization covering value metric selection, tier architecture, price point research, pricing page design, price increase execution, and competitive pricing analysis.

71 21
Explore
borghei/Claude-Skills

referral-program

Referral and affiliate program design covering referral loop architecture, incentive design, trigger moment optimization, viral coefficient modeling, affiliate program structure, and optimization playbook.

71 21
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results