Agent skill
skill-creator
Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
Install this agent skill to your Project
npx add-skill https://github.com/EvoScientist/EvoScientist/tree/main/EvoScientist/skills/skill-creator
SKILL.md
Skill Creator
A skill for creating new skills and iteratively improving them.
At a high level, the process of creating a skill goes like this:
- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run the agent with access to the skill on them
- Help the user evaluate the results both qualitatively and quantitatively
- While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
- Use the
eval-viewer/generate_review.pyscript to show the user the results for them to look at, and also let them look at the quantitative metrics
- Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
- Repeat until you're satisfied
- Expand the test set and try again at larger scale
Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages.
Communicating with the user
The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. Pay attention to context cues to understand how to phrase your communication! In the default case:
- "evaluation" and "benchmark" are borderline, but OK
- for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them
It's OK to briefly explain terms if you're in doubt.
Skill Design Fundamentals
About Skills
Skills are modular, self-contained packages that extend agent capabilities by providing specialized knowledge, workflows, and tools. Think of them as "onboarding guides" for specific domains or tasks -- they transform a general-purpose agent into a specialized agent equipped with procedural knowledge and domain expertise.
What Skills Provide
- Specialized workflows - Multi-step procedures for specific domains
- Tool integrations - Instructions for working with specific file formats or APIs
- Domain expertise - Company-specific knowledge, schemas, business logic
- Bundled resources - Scripts, references, and assets for complex and repetitive tasks
Concise is Key
The context window is a public good. Skills share the context window with everything else the agent needs: system prompt, conversation history, other Skills' metadata, and the actual user request.
Default assumption: The agent is already very capable. Only add context the agent doesn't already have. Challenge each piece of information: "Does the agent really need this explanation?" and "Does this paragraph justify its token cost?"
Prefer concise examples over verbose explanations.
Set Appropriate Degrees of Freedom
Match the level of specificity to the task's fragility and variability:
High freedom (text-based instructions): Use when multiple approaches are valid, decisions depend on context, or heuristics guide the approach.
Medium freedom (pseudocode or scripts with parameters): Use when a preferred pattern exists, some variation is acceptable, or configuration affects behavior.
Low freedom (specific scripts, few parameters): Use when operations are fragile and error-prone, consistency is critical, or a specific sequence must be followed.
Think of the agent as exploring a path: a narrow bridge with cliffs needs specific guardrails (low freedom), while an open field allows many routes (high freedom).
Anatomy of a Skill
Every skill consists of a required SKILL.md file and optional bundled resources:
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter metadata (required)
│ │ ├── name: (required)
│ │ └── description: (required)
│ └── Markdown instructions (required)
└── Bundled Resources (optional)
├── scripts/ - Executable code (Python/Bash/etc.)
├── references/ - Documentation intended to be loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts, etc.)
SKILL.md (required)
- Frontmatter (YAML): Contains
nameanddescriptionfields. These are the only fields that the agent reads to determine when the skill gets used, thus it is very important to be clear and comprehensive in describing what the skill is, and when it should be used. - Body (Markdown): Instructions and guidance for using the skill. Only loaded AFTER the skill triggers (if at all).
Scripts (scripts/)
Executable code (Python/Bash/etc.) for tasks that require deterministic reliability or are repeatedly rewritten.
- When to include: When the same code is being rewritten repeatedly or deterministic reliability is needed
- Benefits: Token efficient, deterministic, may be executed without loading into context
References (references/)
Documentation and reference material intended to be loaded as needed into context.
- When to include: For documentation that the agent should reference while working
- Best practice: If files are large (>10k words), include grep search patterns in SKILL.md
- Avoid duplication: Information should live in either SKILL.md or references files, not both
Assets (assets/)
Files not intended to be loaded into context, but rather used within the output the agent produces.
- When to include: When the skill needs files that will be used in the final output (templates, images, fonts, etc.)
What to Not Include
Do NOT create extraneous documentation or auxiliary files (README.md, CHANGELOG.md, INSTALLATION_GUIDE.md, etc.). The skill should only contain information needed for an AI agent to do the job at hand.
Progressive Disclosure
Skills use a three-level loading system to manage context efficiently:
- Metadata (name + description) - Always in context (~100 words)
- SKILL.md body - When skill triggers (<500 lines ideal)
- Bundled resources - As needed (unlimited, scripts can execute without loading)
Key patterns:
- Keep SKILL.md under 500 lines; split into reference files when approaching this limit
- Reference files clearly from SKILL.md with guidance on when to read them
- For large reference files (>300 lines), include a table of contents
Domain organization: When a skill supports multiple domains/frameworks, organize by variant:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
├── aws.md
├── gcp.md
└── azure.md
The agent reads only the relevant reference file.
Creating a Skill
Capture Intent
Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first.
- What should this skill enable the agent to do?
- When should this skill trigger? (what user phrases/contexts)
- What's the expected output format?
- Should we set up test cases to verify the skill works?
Interview and Research
Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.
Write the SKILL.md
Based on the user interview, fill in these components:
- name: Skill identifier (kebab-case)
- description: When to trigger, what it does. This is the primary triggering mechanism -- include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Make descriptions a little "pushy" to combat undertriggering.
- compatibility: Required tools, dependencies (optional)
- the rest of the skill
Skill Writing Guide
Writing style: Use imperative form. Explain the why behind instructions rather than heavy-handed MUSTs. LLMs have good theory of mind -- transmit understanding, not just rules.
Output format patterns: See references/output-patterns.md for template and example patterns.
Multi-step processes: See references/workflows.md for sequential workflows and conditional logic.
Implementing Resources
Start with the reusable resources identified during planning: scripts/, references/, and assets/ files. Added scripts must be tested by actually running them.
Test Cases
After writing the skill draft, come up with 2-3 realistic test prompts. Share them with the user for confirmation. Save test cases to evals/evals.json:
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}
See references/schemas.md for the full schema (including the assertions field).
Running and Evaluating Test Cases
This section is one continuous sequence. Put results in <skill-name>-workspace/ as a sibling to the skill directory. Organize results by iteration (iteration-1/, iteration-2/, etc.) and within that, each test case gets a directory (eval-0/, eval-1/, etc.).
Step 1: Spawn all runs (with-skill AND baseline) using task tool
For each test case, use the task tool to spawn two sub-agents -- one with the skill, one without. Launch everything at once so it all finishes around the same time.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the user cares about>
Baseline run (same prompt, but the baseline depends on context):
- Creating a new skill: no skill at all. Same prompt, no skill path, save to
without_skill/outputs/. - Improving an existing skill: the old version. Snapshot the skill first, then point the baseline at the snapshot. Save to
old_skill/outputs/.
Write an eval_metadata.json for each test case (assertions can be empty for now):
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "The user's task prompt",
"assertions": []
}
Step 2: While runs are in progress, draft assertions
Draft quantitative assertions for each test case and explain them to the user. Good assertions are objectively verifiable and have descriptive names. Subjective skills are better evaluated qualitatively.
Update the eval_metadata.json files and evals/evals.json with the assertions once drafted.
Step 3: As runs complete, capture timing data
When each sub-agent task completes, save timing data to timing.json:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Step 4: Grade, aggregate, and launch the viewer
Once all runs are done:
-
Grade each run -- spawn a grader sub-agent that reads
agents/grader.mdand evaluates each assertion. Save results tograding.json. The grading.json expectations array must use fieldstext,passed, andevidence. For programmatic assertions, write and run a script. -
Aggregate into benchmark:
bashpython -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>This produces
benchmark.jsonandbenchmark.md. -
Do an analyst pass -- read the benchmark data and surface patterns. See
agents/analyzer.mdfor what to look for. -
Launch the viewer:
bashnohup python <skill-creator-path>/eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ > /dev/null 2>&1 & VIEWER_PID=$!For iteration 2+, also pass
--previous-workspace <workspace>/iteration-<N-1>.Headless environments: Use
--static <output_path>to write a single-file HTML instead of starting a server. Note: the file still loads Google Fonts and SheetJS from CDN, so an internet connection is needed for full rendering. -
Tell the user the results are ready in their browser.
What the user sees in the viewer
The "Outputs" tab shows one test case at a time with prompt, output, previous output (iteration 2+), formal grades, and feedback textbox. The "Benchmark" tab shows stats summary. Navigation via prev/next or arrow keys. "Submit All Reviews" saves to feedback.json.
Step 5: Read the feedback
When the user tells you they're done, read feedback.json. Empty feedback means the user thought it was fine. Focus improvements on test cases with specific complaints.
Improving the Skill
How to think about improvements
-
Generalize from the feedback. Don't overfit to specific examples. Rather than fiddly changes or oppressive MUSTs, try different metaphors or patterns.
-
Keep the prompt lean. Remove things that aren't pulling their weight. Read transcripts, not just outputs.
-
Explain the why. LLMs are smart. Transmit understanding into instructions rather than rigid rules.
-
Look for repeated work across test cases. If all test cases independently wrote similar helper scripts, bundle that script in
scripts/.
The iteration loop
After improving:
- Apply improvements to the skill
- Rerun all test cases into a new
iteration-<N+1>/directory, including baselines - Launch the reviewer with
--previous-workspace - Wait for user review
- Read feedback, improve again, repeat
Keep going until the user is happy, feedback is empty, or you're not making progress.
Advanced: Blind Comparison
For rigorous comparison between two versions, read agents/comparator.md and agents/analyzer.md. The idea: give two outputs to an independent agent without telling it which is which, and let it judge quality.
This is optional and most users won't need it.
Description Optimization
The description field in SKILL.md frontmatter determines whether the agent invokes a skill. After creating or improving a skill, offer to optimize the description.
Step 1: Generate trigger eval queries
Create 20 eval queries -- a mix of should-trigger and should-not-trigger. Save as JSON:
[
{"query": "the user prompt", "should_trigger": true},
{"query": "another prompt", "should_trigger": false}
]
Queries must be realistic and specific. Focus on edge cases rather than clear-cut ones.
For should-trigger queries (8-10): different phrasings of the same intent, cases where the user doesn't explicitly name the skill but clearly needs it.
For should-not-trigger queries (8-10): near-misses that share keywords but need something different. Don't make them obviously irrelevant.
Step 2: Review with user
Present the eval set using the HTML template:
- Read
assets/eval_review.html - Replace
__EVAL_DATA_PLACEHOLDER__,__SKILL_NAME_PLACEHOLDER__,__SKILL_DESCRIPTION_PLACEHOLDER__ - Write to a temp file and open it
- User edits queries and exports the eval set
Step 3: Run the optimization loop
Tell the user this will take some time. Save the eval set to the workspace, then run:
python -m scripts.run_loop \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--model <model> --provider <provider> \
--max-iterations 5 \
--verbose
Use the model and provider from the user's EvoSci configuration so the triggering test matches their actual experience. Supports all 7+ providers (Anthropic, OpenAI, Google, NVIDIA, SiliconFlow, OpenRouter, Ollama, and custom endpoints).
This handles the full optimization loop automatically: splits eval set 60/40 train/test, evaluates current description (3 runs per query), proposes improvements, re-evaluates, iterates up to 5 times. Returns JSON with best_description selected by test score.
How skill triggering works
Skills appear in the agent's system prompt with name + description. The agent decides whether to consult a skill based on that description. The agent only consults skills for tasks it can't easily handle on its own -- complex, multi-step, or specialized queries reliably trigger skills when the description matches.
Step 4: Apply the result
Take best_description from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report scores.
Installing and Packaging
Install
Once the skill is ready, install it:
skill_manager(action="install", source="/<skill-name>")
Verify
skill_manager(action="list")
Validation checklist
- SKILL.md has valid YAML frontmatter with
nameanddescription - Skill name is kebab-case (lowercase, digits, hyphens)
- Description clearly explains what the skill does and when to use it
- All referenced resource files exist
- No extraneous files (README.md, CHANGELOG.md, etc.)
You can also validate programmatically:
python -m scripts.quick_validate <path/to/skill>
Initialize a new skill
python -m scripts.init_skill <skill-name> --path <path>
Package for distribution
python -m scripts.package_skill <path/to/skill-folder>
Reference Files
The agents/ directory contains instructions for specialized sub-agents:
agents/grader.md-- How to evaluate assertions against outputsagents/comparator.md-- How to do blind A/B comparisonagents/analyzer.md-- How to analyze benchmark results
The references/ directory has additional documentation:
references/schemas.md-- JSON structures for evals.json, grading.json, benchmark.jsonreferences/output-patterns.md-- Output format template patternsreferences/workflows.md-- Sequential workflow and conditional logic patterns
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
find-skills
Helps users discover agent skills from the open ecosystem. Searches skills.sh and presents options for installation via the built-in skill_manager tool.
paper-writing
Guides writing academic papers section by section using an 11-step workflow with LaTeX templates and counterintuitive writing tactics. Covers Abstract, Introduction, Method, Experiments, Related Work, Conclusion, and Supplementary. Use when: user asks to write or draft a paper section, needs LaTeX templates, wants to improve academic writing quality, optimize novelty framing, or mentions 'write introduction', 'draft method', 'paper writing'. Do NOT use for pre-submission review (use paper-review), experiment execution (use experiment-pipeline), or paper planning/story design (use paper-planning).
evo-memory
Manages persistent research memory across ideation and experimentation cycles. Maintains two stores: Ideation Memory M_I (feasible/unsuccessful directions) and Experimentation Memory M_E (reusable strategies for data processing, model training, architecture, debugging). Three evolution mechanisms: IDE (after idea-tournament), IVE (after experiment failure — classifies failures as implementation vs fundamental), ESE (after experiment success — extracts reusable strategies). Use when: updating memory after completing idea tournaments or experiment pipelines, classifying why a method failed (implementation vs fundamental failure), starting a new research cycle needing prior knowledge, user mentions 'update memory', 'classify failure', 'what worked before', 'research history', 'evolution'. Do NOT use for running experiments (use experiment-pipeline), debugging experiment code (use experiment-craft), or generating ideas (use idea-tournament).
paper-navigator
End-to-end academic paper workflow: disambiguate queries, discover papers (search, citation traversal, recommendations, arXiv monitoring, trending, GitHub search), evaluate (TLDR, citations, code, SOTA), read with structured analysis (3-level strategy), and organize into literature maps or reports. Use when: finding papers, reading a paper, related work, literature survey, citation analysis, research trends, SOTA results, datasets, or literature reports. Do NOT use for writing a literature review section (use paper-writing), comparing research ideas (use idea-tournament), or planning paper structure (use paper-planning).
paper-review
Guides self-review of YOUR OWN academic paper before submission with adversarial stress-testing. Core method: 5-aspect checklist (contribution sufficiency, writing clarity, results quality, testing completeness, method design), counterintuitive protocol (reject-first simulation, delete unsupported claims, score trust, promote limitations, attack novelty), reverse-outlining, and figure/table quality checks. Use when: user wants to self-review or self-check their own paper draft before submission, stress-test their claims, prepare for reviewer criticism, or mentions 'self-review', 'check my draft', 'is my paper ready'. Do NOT use for writing a peer review of someone else's paper, and do NOT use after receiving actual reviews (use paper-rebuttal instead).
experiment-craft
Use this skill when the user wants to debug, diagnose, or systematically iterate on an experiment that already exists, or when they need a structured experiment log for tracking runs, hypotheses, failures, results, and next steps during active research. Apply it to underperforming methods, training that will not converge, regressions after a change, inconsistent results across datasets, aimless experimentation without progress, and questions like 'why doesn't this work?', 'no progress after many attempts', or 'how should I investigate this failure?'. Also use it for setting up practical experiment logging/record-keeping that supports debugging and iteration. Do not use it for designing a brand-new experiment pipeline or full experiment program (use experiment-pipeline), generating research ideas, fixing isolated coding/syntax errors, or writing retrospective summaries into research memory/notes/knowledge bases.
Didn't find tool you were looking for?