Agent skill
golden-dataset
Golden dataset lifecycle patterns for curation, versioning, quality validation, and CI integration. Use when building evaluation datasets, managing dataset versions, validating quality scores, or integrating golden tests into pipelines.
Install this agent skill to your Project
npx add-skill https://github.com/yonatangross/orchestkit/tree/main/plugins/ork/skills/golden-dataset
Metadata
Additional technical details for this skill
- category
- document-asset-creation
SKILL.md
Golden Dataset
Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in rules/ loaded on-demand.
Quick Reference
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Curation | 3 | HIGH | Content collection, annotation pipelines, diversity analysis |
| Management | 3 | HIGH | Versioning, backup/restore, CI/CD automation |
| Validation | 3 | CRITICAL | Quality scoring, drift detection, regression testing |
| Add Workflow | 1 | HIGH | 9-phase curation, quality scoring, bias detection, silver-to-gold |
Total: 10 rules across 4 categories
Curation
Content collection, multi-agent annotation, and diversity analysis for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Collection | rules/curation-collection.md |
Content type classification, quality thresholds, duplicate prevention |
| Annotation | rules/curation-annotation.md |
Multi-agent pipeline, consensus aggregation, Langfuse tracing |
| Diversity | rules/curation-diversity.md |
Difficulty stratification, domain coverage, balance guidelines |
Management
Versioning, storage, and CI/CD automation for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Versioning | rules/management-versioning.md |
JSON backup format, embedding regeneration, disaster recovery |
| Storage | rules/management-storage.md |
Backup strategies, URL contract, data integrity checks |
| CI Integration | rules/management-ci.md |
GitHub Actions automation, pre-deployment validation, weekly backups |
Validation
Quality scoring, drift detection, and regression testing for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Quality | rules/validation-quality.md |
Schema validation, content quality, referential integrity |
| Drift | rules/validation-drift.md |
Duplicate detection, semantic similarity, coverage gap analysis |
| Regression | rules/validation-regression.md |
Difficulty distribution, pre-commit hooks, full dataset validation |
Add Workflow
Structured workflow for adding new documents to the golden dataset.
| Rule | File | Key Pattern |
|---|---|---|
| Add Document | rules/curation-add-workflow.md |
9-phase curation, parallel quality analysis, bias detection |
Quick Start Example
from app.shared.services.embeddings import embed_text
async def validate_before_add(document: dict, source_url_map: dict) -> dict:
"""Pre-addition validation for golden dataset entries."""
errors = []
# 1. URL contract check
if "placeholder" in document.get("source_url", ""):
errors.append("URL must be canonical, not a placeholder")
# 2. Content quality
if len(document.get("title", "")) < 10:
errors.append("Title too short (min 10 chars)")
# 3. Tag requirements
if len(document.get("tags", [])) < 2:
errors.append("At least 2 domain tags required")
return {"valid": len(errors) == 0, "errors": errors}
Key Decisions
| Decision | Recommendation |
|---|---|
| Backup format | JSON (version controlled, portable) |
| Embedding storage | Exclude from backup (regenerate on restore) |
| Quality threshold | >= 0.70 quality score for inclusion |
| Confidence threshold | >= 0.65 for auto-include |
| Duplicate threshold | >= 0.90 similarity blocks, >= 0.85 warns |
| Min tags per entry | 2 domain tags |
| Min test queries | 3 per document |
| Difficulty balance | Trivial 3, Easy 3, Medium 5, Hard 3 minimum |
| CI frequency | Weekly automated backup (Sunday 2am UTC) |
Common Mistakes
- Using placeholder URLs instead of canonical source URLs
- Skipping embedding regeneration after restore
- Not validating referential integrity between documents and queries
- Over-indexing on articles (neglecting tutorials, research papers)
- Missing difficulty distribution balance in test queries
- Not running verification after backup/restore operations
- Testing restore procedures in production instead of staging
- Committing SQL dumps instead of JSON (not version-control friendly)
Evaluations
See test-cases.json for 9 test cases across all categories.
Related Skills
ork:rag-retrieval- Retrieval evaluation using golden datasetlangfuse-observability- Tracing patterns for curation workflowsork:testing-unit- Unit testing patterns and strategiesai-native-development- Embedding generation for restore
Capability Details
curation
Keywords: golden dataset, curation, content collection, annotation, quality criteria
Solves:
- Classify document content types for golden dataset
- Run multi-agent quality analysis pipelines
- Generate test queries for new documents
management
Keywords: golden dataset, backup, restore, versioning, disaster recovery
Solves:
- Backup and restore golden datasets with JSON
- Regenerate embeddings after restore
- Automate backups with CI/CD
validation
Keywords: golden dataset, validation, schema, duplicate detection, quality metrics
Solves:
- Validate entries against document schema
- Detect duplicate or near-duplicate entries
- Analyze dataset coverage and distribution gaps
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
expect
Diff-aware AI browser testing — analyzes git changes, generates targeted test plans, and executes them via agent-browser. Reads git diff to determine what changed, maps changes to affected pages via route map, generates a test plan scoped to the diff, and runs it with pass/fail reporting. Use when testing UI changes, verifying PRs before merge, running regression checks on changed components, or validating that recent code changes don't break the user-facing experience.
github-operations
GitHub CLI operations for issues, PRs, milestones, and Projects v2. Covers gh commands, REST API patterns, and automation scripts. Use when managing GitHub issues, PRs, milestones, or Projects with gh.
chain-patterns
Chain patterns for CC 2.1.71 pipelines — MCP detection, handoff files, checkpoint-resume, worktree agents, CronCreate monitoring. Use when building multi-phase pipeline skills. Loaded via skills: field by pipeline skills (fix-issue, implement, brainstorm, verify). Not user-invocable.
storybook-mcp-integration
Storybook MCP server integration for component-aware AI development. Covers 6 tools across 3 toolsets (dev, docs, testing): component discovery via list-all-documentation/get-documentation, story previews via preview-stories, and automated testing via run-story-tests. Use when generating components that should reuse existing Storybook components, running component tests via MCP, or previewing stories in chat.
component-search
Search 21st.dev component registry for production-ready React components. Finds components by natural language description, filters by framework and style system, returns ranked results with install instructions. Use when looking for UI components, finding alternatives to existing components, or sourcing design system building blocks.
ai-ui-generation
AI-assisted UI generation patterns for json-render, v0, Bolt, and Cursor workflows. Covers prompt engineering for component generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.
Didn't find tool you were looking for?