Agent skills
evaluation-harness

Agent skill

evaluation-harness

Builds repeatable evaluation systems with golden datasets, scoring rubrics, pass/fail thresholds, and regression reports. Use for "LLM evaluation", "testing AI systems", "quality assurance", or "model benchmarking".

View SKILL.md on GitHub Repository

Stars 23

Forks 2

Install this agent skill to your Project

npx add-skill https://github.com/patricio0312rev/skills/tree/main/ai-engineering/evaluation-harness

SKILL.md

Evaluation Harness

Build systematic evaluation frameworks for LLM applications.

Golden Dataset Format

json

[
  {
    "id": "test_001",
    "category": "code_generation",
    "input": "Write a Python function to reverse a string",
    "expected_output": "def reverse_string(s: str) -> str:\n    return s[::-1]",
    "rubric": {
      "correctness": 1.0,
      "style": 0.8,
      "documentation": 0.5
    },
    "metadata": {
      "difficulty": "easy",
      "tags": ["python", "strings"]
    }
  }
]

Scoring Rubrics

python

from typing import Dict, Any

def score_exact_match(actual: str, expected: str) -> float:
    """Binary score: 1.0 if exact match, 0.0 otherwise"""
    return 1.0 if actual.strip() == expected.strip() else 0.0

def score_semantic_similarity(actual: str, expected: str) -> float:
    """Cosine similarity of embeddings"""
    actual_emb = get_embedding(actual)
    expected_emb = get_embedding(expected)
    return cosine_similarity(actual_emb, expected_emb)

def score_contains_keywords(actual: str, keywords: List[str]) -> float:
    """Percentage of required keywords present"""
    found = sum(1 for kw in keywords if kw.lower() in actual.lower())
    return found / len(keywords)

def score_with_llm(actual: str, expected: str, rubric: Dict[str, float]) -> Dict[str, float]:
    """Use LLM as judge"""
    prompt = f"""
    Grade this output on a scale of 0-1 for each criterion:

    Expected: {expected}
    Actual: {actual}

    Criteria: {', '.join(rubric.keys())}

    Return JSON with scores.
    """
    return json.loads(llm(prompt))

Test Runner

python

class EvaluationHarness:
    def __init__(self, dataset_path: str):
        self.dataset = self.load_dataset(dataset_path)
        self.results = []

    def run_evaluation(self, model_fn):
        for test_case in self.dataset:
            # Generate output
            actual = model_fn(test_case["input"])

            # Score
            scores = self.score_output(
                actual,
                test_case["expected_output"],
                test_case["rubric"]
            )

            # Record result
            self.results.append({
                "test_id": test_case["id"],
                "category": test_case["category"],
                "scores": scores,
                "passed": self.check_threshold(scores, test_case),
                "actual_output": actual,
            })

        return self.generate_report()

    def score_output(self, actual, expected, rubric):
        return {
            "exact_match": score_exact_match(actual, expected),
            "semantic_similarity": score_semantic_similarity(actual, expected),
            **score_with_llm(actual, expected, rubric)
        }

    def check_threshold(self, scores, test_case):
        min_scores = test_case.get("min_scores", {})
        for metric, threshold in min_scores.items():
            if scores.get(metric, 0) < threshold:
                return False
        return True

Thresholds & Pass Criteria

python

# Define thresholds per category
THRESHOLDS = {
    "code_generation": {
        "correctness": 0.9,
        "style": 0.7,
    },
    "summarization": {
        "semantic_similarity": 0.8,
        "brevity": 0.7,
    },
    "classification": {
        "exact_match": 1.0,
    }
}

def check_test_passed(result: Dict) -> bool:
    category = result["category"]
    thresholds = THRESHOLDS.get(category, {})

    for metric, threshold in thresholds.items():
        if result["scores"].get(metric, 0) < threshold:
            return False

    return True

Regression Report

python

def generate_regression_report(baseline_results, current_results):
    report = {
        "summary": {},
        "regressions": [],
        "improvements": [],
        "unchanged": 0
    }

    for baseline, current in zip(baseline_results, current_results):
        assert baseline["test_id"] == current["test_id"]

        baseline_passed = baseline["passed"]
        current_passed = current["passed"]

        if baseline_passed and not current_passed:
            report["regressions"].append({
                "test_id": baseline["test_id"],
                "category": baseline["category"],
                "baseline_scores": baseline["scores"],
                "current_scores": current["scores"],
            })
        elif not baseline_passed and current_passed:
            report["improvements"].append(baseline["test_id"])
        else:
            report["unchanged"] += 1

    report["summary"] = {
        "total_tests": len(baseline_results),
        "regressions": len(report["regressions"]),
        "improvements": len(report["improvements"]),
        "unchanged": report["unchanged"],
    }

    return report

Continuous Evaluation

python

# Run evaluation on every commit
def ci_evaluation():
    harness = EvaluationHarness("golden_dataset.json")
    results = harness.run_evaluation(production_model)

    # Check for regressions
    baseline = load_baseline("baseline_results.json")
    report = generate_regression_report(baseline, results)

    # Fail CI if regressions
    if report["summary"]["regressions"] > 0:
        print(f"❌ {report['summary']['regressions']} regressions detected!")
        sys.exit(1)

    print("✅ All tests passed!")

Best Practices

Representative dataset: Cover edge cases
Multiple metrics: Don't rely on one score
Human validation: Review LLM judge scores
Version datasets: Track changes over time
Automate in CI: Catch regressions early
Regular updates: Add new test cases

Output Checklist

Golden dataset created (50+ examples)
Multiple scoring functions
Pass/fail thresholds defined
Test runner implemented
Regression comparison
Report generation
CI integration
Baseline established

Maintainer

patricio0312rev Core maintainer

Source details

Full Name: patricio0312rev/skills
Branch: main
Path in repo: ai-engineering/evaluation-harness
License: MIT License
Topics: ai claude-code claude cursor skills copilot-coding-agent cursor-ai

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

patricio0312rev/skills

rate-limiting-abuse-protection

Implements rate limiting and abuse prevention with per-route policies, IP/user-based limits, sliding windows, safe error responses, and observability. Use when adding "rate limiting", "API protection", "abuse prevention", or "DDoS protection".

23 2

Explore

patricio0312rev/skills

rbac-permissions-builder

Implements role-based access control with permission matrix, route guards, policy functions, and UI permission hints. Provides middleware/guards, helper utilities, test suggestions, and permission checking patterns. Use when building "RBAC", "permissions", "access control", or "authorization".

23 2

Explore

patricio0312rev/skills

websocket-realtime-builder

Implements real-time features using WebSockets with Socket.io, rooms, authentication, and reconnection handling. Use when users request "real-time updates", "WebSocket", "Socket.io", "live chat", or "push notifications".

23 2

Explore

patricio0312rev/skills

webhook-receiver-hardener

Secures webhook receivers with signature verification, retry handling, deduplication, idempotency keys, and error responses. Provides verification code, dedupe storage strategy, runbook for incidents. Use when implementing "webhooks", "webhook security", "event receivers", or "third-party integrations".

23 2

Explore

patricio0312rev/skills

auth-module-builder

Implements secure authentication patterns including login/registration, session management, JWT tokens, password hashing, cookie settings, and CSRF protection. Provides auth routes, middleware, security configurations, and threat model documentation. Use when building "authentication", "login system", "JWT auth", or "session management".

23 2

Explore

patricio0312rev/skills

rest-to-graphql-migrator

Migrates REST APIs to GraphQL incrementally with schema stitching, REST datasources, and gradual endpoint migration. Use when users request "migrate to GraphQL", "REST to GraphQL", "GraphQL wrapper", or "API modernization".

23 2

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Evaluation Harness

Golden Dataset Format

Scoring Rubrics

Test Runner

Thresholds & Pass Criteria

Regression Report

Continuous Evaluation

Best Practices

Output Checklist

Recommended Agent Skills

rate-limiting-abuse-protection

rbac-permissions-builder

websocket-realtime-builder

webhook-receiver-hardener

auth-module-builder

rest-to-graphql-migrator