Agent skill

add-benchmark

Add a new SWE benchmark task from a real GitHub bug-fix. Use when the user provides a GitHub issue or PR URL and wants to add it to the bench-swe pipeline.

Stars 144
Forks 12

Install this agent skill to your Project

npx add-skill https://github.com/ory/lumen/tree/main/.claude/skills/add-benchmark

SKILL.md

Add SWE Benchmark

Add a new benchmark task to the bench-swe pipeline from a real GitHub bug-fix. The human provides the GitHub issue or PR URL; the agent handles extraction, validation, and file creation.

Arguments

  • url (required): GitHub issue or PR URL (e.g. https://github.com/gorilla/mux/issues/534 or https://github.com/gorilla/mux/pull/585)
  • language (required): One of: go, python, typescript, javascript, rust, ruby, java, c, cpp, php, csharp

Repository selection criteria

Good benchmark repos are focused libraries with a clear bug — not large applications. Before submitting a URL, prefer repos that are:

  • Size: < 50 MB and < 800 source files (excludes vendor/node_modules)
  • Dependencies: < 50 direct dependencies (go.mod, package.json, etc.)
  • Scope: a library or small service, not a monorepo or full application

The agent will reject repos that exceed these limits.

Steps

  1. Dispatch the task-curator agent with the provided arguments. The agent will:

    • Validate inputs (URL, language)
    • Check repository size and dependency count (rejects oversized repos)
    • Resolve the fix PR (from issue or directly)
    • Clone the repo, extract base/fix commits, and generate the gold patch
    • Determine the test command from repo conventions
    • Write task JSON to bench-swe/tasks/{language}/ and patch to bench-swe/patches/
    • Run 5 inline verification checks (patch applies, files match, no leaks, schema completeness, no test files in patch)
    • Fix any issues found during verification
  2. Report the result including:

    • Task ID, repo, issue URL
    • Files and lines changed
    • Verification table

Expand your agent's capabilities with these related and highly-rated skills.

ory/lumen

doctor

Run a health check on the bundled Lumen semantic search setup for the current project, verify backend reachability and index freshness, and summarize remediation steps.

144 12
Explore
ory/lumen

reindex

Refresh or rebuild the bundled Lumen index for the current project, preferring MCP-driven refreshes and using the CLI only for an explicit clean rebuild.

144 12
Explore
davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298
Explore
davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

23,776 2,298
Explore
davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

23,776 2,298
Explore
davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

23,776 2,298
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results