Agent skill

implementing-rag

Set up RAG pipelines with document chunking, embedding generation, and retrieval strategies using LlamaIndex. Use when building new RAG systems, choosing chunking approaches, selecting embedding models, or implementing vector/hybrid retrieval for src/ or src-iLand/ pipelines.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/implementing-rag

SKILL.md

Implementing RAG Systems

Quick-start guide for building production-grade RAG pipelines with LlamaIndex. This skill helps you set up the foundational components: document processing, embedding generation, and retrieval.

When to Use This Skill

  • Building a new RAG pipeline from scratch
  • Choosing optimal chunking strategy for your documents
  • Selecting and configuring embedding models
  • Implementing retrieval (vector, BM25, or hybrid search)
  • Optimizing for specific use cases (Thai language, legal documents, product catalogs)
  • Integrating with existing pipelines in src/ or src-iLand/

Quick Decision Guide

Chunking Strategy

  • Legal/Technical docs (like Thai land deeds) → 512-1024 tokens
  • Narrative content → 1024-2048 tokens
  • Q&A pairs → 256-512 tokens
  • Rule of thumb: When halving chunk size, double similarity_top_k

Embedding Model Selection

  • Cost-sensitive → HuggingFace bge-base-en-v1.5 with ONNX (3-7x faster, free)
  • Quality-first → OpenAI text-embedding-3-small or JinaAI-base
  • Multi-language (Thai) → Cohere embed-multilingual-v3.0
  • Balanced → OpenAI or bge-large (open-source)

Retrieval Strategy

  • Simple queries → Vector search
  • Keyword-heavy → BM25
  • Production systems → Hybrid (vector + BM25)
  • Structured data → Metadata filtering
  • Large doc sets (100+) → Document summary or recursive retrieval

Quick Start Patterns

Pattern 1: Standard RAG Setup (Vector Search)

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure chunking
Settings.chunk_size = 1024
Settings.chunk_overlap = 20

# Configure embeddings
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    embed_batch_size=100
)

# Load and index documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Your question here")

Pattern 2: Multi-Language Setup (Thai Support)

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.core.node_parser import SentenceSplitter

# Use multilingual embeddings
embed_model = CohereEmbedding(
    model_name="embed-multilingual-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# Configure chunking for Thai text
node_parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=50
)

# Load Thai documents
documents = SimpleDirectoryReader("./data").load_data()

# Create index with multilingual support
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model,
    transformations=[node_parser]
)

query_engine = index.as_query_engine(similarity_top_k=5)

Pattern 3: Hybrid Search (Production-Ready)

python
from llama_index.core import VectorStoreIndex
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
import Stemmer

# Create vector index
index = VectorStoreIndex.from_documents(documents)
vector_retriever = index.as_retriever(similarity_top_k=10)

# Create BM25 retriever
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore,
    similarity_top_k=10,
    stemmer=Stemmer.Stemmer("english")
)

# Combine with query fusion
hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=5,
    mode="reciprocal_rerank",
    use_async=True
)

# Use in query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(hybrid_retriever)

Pattern 4: Cost-Optimized Local Embeddings

python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

# Use local model with ONNX acceleration (3-7x faster)
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-base-en-v1.5",
    backend="onnx"  # Requires: pip install optimum[onnxruntime]
)

# Rest of the pipeline remains the same
index = VectorStoreIndex.from_documents(documents)

Your Codebase Integration

For src/ Pipeline

Document Preprocessing (src/02_prep_doc_for_embedding.py):

  • Add configurable chunking strategy
  • Support multiple node parser types

Batch Embeddings (src/09_enhanced_batch_embeddings.py):

  • Increase embed_batch_size to 100 for faster processing
  • Consider local models to eliminate API costs

Basic Retrieval (src/10_basic_query_engine.py):

  • Baseline vector search implementation
  • Adjust similarity_top_k based on chunk size

For src-iLand/ Pipeline (Thai Land Deeds)

Data Processing (src-iLand/data_processing/):

  • Current chunk size: 1024 tokens (good for legal documents)
  • Consider testing 512 with increased top_k for better precision

Embeddings (src-iLand/docs_embedding/):

  • Current: OpenAI text-embedding-3-small
  • Test: Cohere embed-multilingual-v3.0 for better Thai understanding

Retrieval (src-iLand/retrieval/retrievers/):

  • Vector strategy: Baseline semantic search
  • Hybrid strategy: Combines vector + BM25 for Thai queries
  • Metadata strategy: Fast filtering (<50ms) on Thai geographic/legal metadata

Detailed References

Load these reference files when you need comprehensive details:

  • reference-chunking.md: Complete chunking strategies guide

    • Node parser types and configurations
    • Chunk size optimization for different domains
    • Sentence-window and hierarchical patterns
    • Metadata preservation
  • reference-embeddings.md: Embedding model selection and optimization

    • All supported models (OpenAI, Cohere, HuggingFace, JinaAI, Voyage)
    • Performance benchmarks (hit rate, MRR)
    • ONNX/OpenVINO optimization details
    • Multi-language support
    • Cost optimization strategies
  • reference-retrieval-basics.md: Core retrieval patterns

    • Vector retrieval implementation
    • BM25 keyword search
    • Hybrid search (vector + BM25)
    • Metadata filtering
    • Configuration parameters and tuning

Common Workflows

Workflow 1: New RAG Pipeline Setup

  • Step 1: Choose chunking strategy based on document type

    • Load reference-chunking.md for detailed guidance
    • Test with sample documents
  • Step 2: Select embedding model

    • Load reference-embeddings.md for model comparison
    • Consider: cost, latency, language support
    • For Thai: use multilingual models
  • Step 3: Implement basic vector retrieval

    • Start with simple VectorStoreIndex
    • Test with sample queries
    • Measure baseline performance
  • Step 4: Optimize chunk size and top_k

    • Adjust based on query results
    • Balance precision vs recall
  • Step 5: (Optional) Upgrade to hybrid search

    • Combine vector + BM25 for production quality

Workflow 2: Migrating from API to Local Embeddings

  • Step 1: Choose local model (see reference-embeddings.md)

    • Recommended: BAAI/bge-base-en-v1.5
  • Step 2: Install optimization backend

    bash
    pip install optimum[onnxruntime]
    
  • Step 3: Update embedding configuration

    python
    Settings.embed_model = HuggingFaceEmbedding(
        model_name="BAAI/bge-base-en-v1.5",
        backend="onnx"
    )
    
  • Step 4: Re-index all documents

    • CRITICAL: Must re-embed with new model
    • Use caching to avoid redundant work
  • Step 5: Validate retrieval quality

    • Compare results with API embeddings
    • Measure latency improvement (3-7x faster expected)

Workflow 3: Thai Language RAG Setup

  • Step 1: Use multilingual embedding model

    python
    embed_model = CohereEmbedding(
        model_name="embed-multilingual-v3.0"
    )
    
  • Step 2: Configure appropriate chunking

    • Test 512-1024 tokens for Thai text
    • Preserve Thai metadata
  • Step 3: Implement hybrid search

    • Vector for semantic understanding
    • BM25 for exact Thai keyword matching
  • Step 4: Add metadata filtering

    • จังหวัด (province), อำเภอ (district), ประเภท (type)
    • Enable fast filtering (<50ms)
  • Step 5: Test with Thai queries

    • Validate Unicode handling
    • Check retrieval relevance

Key Reminders

Critical Requirements:

  • Re-indexing: When changing embedding models, MUST re-index all data
  • Model consistency: Identical models for indexing and querying
  • Chunk size + top_k: Halving chunk size → double similarity_top_k

Best Practices:

  • Start simple (vector search) → Add complexity as needed (hybrid, metadata)
  • Test with representative queries before full indexing
  • Use local models for cost optimization at scale
  • Enable parallel loading (num_workers=10) for 13x speedup

Performance Tips:

  • Batch size: Set embed_batch_size=100 for API calls
  • Parallel loading: Use SimpleDirectoryReader(...).load_data(num_workers=10)
  • Caching: Enable ingestion pipeline caching for faster re-runs

Next Steps

After implementing basic RAG:

  • Optimize: Use optimizing-rag skill for reranking, caching, production deployment
  • Evaluate: Use evaluating-rag skill to measure hit rate, MRR, and compare strategies

Didn't find tool you were looking for?

Be as detailed as possible for better results