Agent skill
eval-rag
Evaluate RAG pipeline retrieval and generation quality separately. Measure Recall@k, Precision@k, MRR, NDCG@k for retrieval. Assess faithfulness and relevance for generation. Use when the AI feature uses retrieval (search, knowledge base, document QA). Do NOT use for non-RAG AI features.
Install this agent skill to your Project
npx add-skill https://github.com/breethomas/bette-think/tree/main/plugins/pm-thought-partner/skills/eval-rag
SKILL.md
Eval RAG
Evaluate RAG pipelines by separating retrieval quality from generation quality. Fix retrieval first.
Entry Point
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EVAL RAG
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Retrieval and generation fail differently. Measure them separately.
What RAG feature are we evaluating?
What does it retrieve from? (knowledge base, docs, database)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Prerequisites
Complete error analysis (/upgrade-evals) on RAG pipeline traces before selecting metrics. Inspect what was retrieved vs. what the model needed. Determine whether the problem is retrieval, generation, or both. Fix retrieval first — the LLM can ignore irrelevant context but cannot generate from missing context.
Core Process
Step 1: Separate Retrieval from Generation
Ask the PM: "When the AI gives a wrong answer, is it because it found the wrong documents, or because it had the right documents but said the wrong thing?"
This determines where to focus:
| Symptom | Likely Cause | Fix |
|---|---|---|
| AI says "I don't know" or gives generic answer | Retrieval: relevant docs not found | Improve chunking, embeddings, or query |
| AI confidently states something not in any doc | Generation: hallucination | Improve prompt, add grounding |
| AI answers a different question than asked | Could be either | Check retrieved docs first |
Step 2: Build a Retrieval Evaluation Dataset
Queries paired with ground-truth relevant document chunks.
Manual curation (highest quality, PM-driven): Ask the PM: "For each of these test queries, which document(s) actually contain the answer?"
The PM maps queries to chunks. This is domain knowledge work — the PM knows the content better than any automated system.
Synthetic QA generation (scalable):
Claude Code executes: For each document chunk, extract a fact and generate a question answerable only from that fact.
Given a chunk of text, extract a specific, self-contained fact from it.
Then write a question that is directly and unambiguously answered
by that fact alone.
Return output in JSON format:
{ "fact": "...", "question": "..." }
Chunk: "{text_chunk}"
Adversarial questions (stress-testing):
Claude Code executes: Create queries that resemble content in multiple chunks but are only answered by one. Find similar chunks via embedding search, then generate a question that only the target chunk answers.
Step 3: Measure Retrieval Quality
Claude Code executes all metric computations. The PM interprets results.
Recall@k — Did we find the right documents?
Recall@k = (relevant docs in top k) / (total relevant docs for query)
Prioritize for first-pass retrieval. High recall = the right docs are in the candidate set.
Precision@k — How much noise in the results?
Precision@k = (relevant docs in top k) / k
Prioritize for reranking evaluation.
MRR (Mean Reciprocal Rank) — How quickly do we find the first right document?
MRR = (1/N) * sum(1/rank_of_first_relevant_doc)
Best for single-fact lookups.
NDCG@k — Are the most relevant results ranked highest?
DCG@k = sum over i=1..k of: rel_i / log2(i+1)
IDCG@k = DCG@k with documents sorted by decreasing relevance
NDCG@k = DCG@k / IDCG@k
Use when documents have varying utility.
Which metric to use:
| Your Query Type | Primary Metric | Why |
|---|---|---|
| Single-fact lookups ("What's our refund policy?") | MRR | One key chunk needed, want it ranked first |
| Broad coverage ("Summarize market trends") | Recall@k | Need all relevant docs, even at cost of noise |
| Ranked results matter | NDCG@k or Precision@k | Quality of ranking matters for the answer |
Choosing k: Factual lookup uses k=1-2. Synthesis query uses k=5-10. Ask the PM: "How many documents does your pipeline pass to the LLM?"
Step 4: Optimize Chunking (if retrieval is the bottleneck)
Treat chunking as a tunable hyperparameter. Claude Code executes a grid search.
Claude Code executes:
Test combinations of chunk size and overlap.
Re-index the corpus for each configuration.
Measure retrieval metrics on the evaluation dataset.
Example results:
| Chunk Size | Overlap | Recall@5 | NDCG@5 |
|---|---|---|---|
| 128 tokens | 0 | 0.82 | 0.69 |
| 128 tokens | 64 | 0.88 | 0.75 |
| 256 tokens | 0 | 0.86 | 0.74 |
| 256 tokens | 128 | 0.89 | 0.77 |
| 512 tokens | 0 | 0.80 | 0.72 |
Present to PM: "256 tokens with 128 overlap gives the best recall. Want to go with that?"
Content-aware chunking: When fixed-size chunks split related information, use natural document boundaries (sections, paragraphs). Augment chunks with context: prepend document title and section headings before embedding.
Step 5: Evaluate Generation Quality
After confirming retrieval works, evaluate what the LLM does with the retrieved context.
Answer faithfulness — Does the output accurately reflect the retrieved context?
- Hallucinations: Information absent from source documents. In RAG, even correct facts from the LLM's own knowledge count as hallucinations.
- Omissions: Relevant information from the context ignored in the output.
- Misinterpretations: Context information represented inaccurately.
Answer relevance — Does the output address the original query? An answer can be faithful to the context but fail to answer what the user asked.
Use /upgrade-evals to discover specific manifestations in your pipeline. Use /build-judge for failure modes that require subjective evaluation.
Step 6: Diagnose with the Metric Pattern Table
Present this to the PM for root cause identification:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RAG DIAGNOSTIC TABLE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
| Context Quality | Faithfulness | Relevance | Diagnosis |
|-----------------|-------------|-----------|-----------|
| Good | Good | Bad | Generator attended to wrong section |
| Good | Bad | -- | Hallucination or misinterpretation |
| Bad | -- | -- | Retrieval problem. Fix first. |
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
What to tell your engineer:
- "Context quality is low" → Fix chunking, embeddings, or query preprocessing.
- "Faithful but not relevant" → Generator needs better instruction to address the actual question.
- "Not faithful" → Add grounding instructions, reduce temperature, or use citation enforcement.
Multi-Hop Retrieval
For queries requiring information from multiple chunks:
Two-hop Recall@k: Fraction of 2-hop queries where both ground-truth chunks appear in top k results.
TwoHopRecall@k = (1/N) * sum(1 if {Chunk1, Chunk2} ⊆ top_k_results)
Diagnose: classify failures as hop 1 miss, hop 2 miss, or rank-out-of-top-k.
Output
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RAG EVALUATION RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Eval dataset: [N] queries with ground-truth mappings
RETRIEVAL:
| Metric | Score | Interpretation |
|--------|-------|---------------|
| Recall@5 | [X] | [good/needs work] |
| Precision@5 | [X] | [good/needs work] |
| MRR | [X] | [good/needs work] |
GENERATION:
| Dimension | Pass Rate | Top Failure Mode |
|-----------|-----------|-----------------|
| Faithfulness | [%] | [description] |
| Relevance | [%] | [description] |
DIAGNOSIS: [Retrieval problem / Generation problem / Both]
RECOMMENDED ACTIONS:
1. [highest-impact fix]
2. [second fix]
3. [third fix]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Anti-Patterns
- Using a single end-to-end correctness metric without separating retrieval and generation.
- Jumping directly to metrics without reading traces first.
- Overfitting to synthetic evaluation data. Validate against real user queries regularly.
- Using similarity metrics (ROUGE, BERTScore, cosine similarity) as primary generation evaluation. Use binary evaluators driven by error analysis.
- Evaluating generation without checking context grounding.
- Optimizing generation when retrieval is the bottleneck.
Methodology: Adapted from Hamel Husain's evaluate-rag skill (evals-skills, MIT license) PM adaptation: Each metric gets plain-language explanation, Claude Code computes metrics, PM interprets diagnostic table
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
four-fits
Find which fit is broken before you burn cash scaling. Brian Balfour's framework for validating sustainable growth readiness.
project-health
Deep-dive health check on a single Linear project. Produces assessment with 7 dimensions - On Track / At Risk / Stalled.
prompt-engineering
Expert prompt optimization system for building production-ready AI features. Use when users request help improving prompts, want to create system prompts, need prompt review/critique, ask for prompt optimization strategies, want to analyze prompt effectiveness, mention prompt engineering best practices, request prompt templates, or need guidance on structuring AI instructions. Also use when users provide prompts and want suggestions for improvement.
strategy-session
Your product soundboard. Work through product decisions conversationally - Claude gathers context, challenges assumptions, captures decisions, and creates Linear issues.
ai-debug
Diagnose why an AI feature is underperforming, hallucinating, or behaving inconsistently. Uses 4D audit to work backwards from symptoms to root cause.
shape-up
Shape work using the Shape Up methodology (Ryan Singer, Basecamp). Walk through the 4-step shaping process to create pitches ready for betting. Distinguishes between established product mode (fixed time, variable scope) and new product mode (looser constraints). Use when planning cycle work, writing pitches, or coaching PMs on shaping.
Didn't find tool you were looking for?