Agent skill
funsloth-check
Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
Install this agent skill to your Project
npx add-skill https://github.com/chrisvoncsefalvay/funsloth/tree/main/skills/funsloth-check
SKILL.md
Dataset Validation for Unsloth Fine-tuning
Validate datasets before fine-tuning with Unsloth.
Quick Start
For automated validation, use the script:
python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16
Workflow
1. Get Dataset Source
Ask for: HF dataset ID (e.g., mlabonne/FineTome-100k) or local path (e.g., ./data.jsonl)
2. Load and Detect Format
Auto-detect format from structure. See DATA_FORMATS.md for details.
| Format | Detection | Key Fields |
|---|---|---|
| Raw | text only |
text |
| Alpaca | instruction + output |
instruction, output |
| ShareGPT | conversations array |
from, value |
| ChatML | messages array |
role, content |
3. Validate Schema
Check required fields exist. Report issues with fix suggestions.
4. Show Samples
Display 2-3 examples for visual verification.
5. Token Analysis
Report statistics: total tokens, min/max/mean/median sequence length.
Flag concerns:
- Sequences > 4096 tokens
- Sequences < 10 tokens
6. Chinchilla Analysis
Ask for target model and LoRA rank, then calculate:
| Chinchilla Fraction | Interpretation |
|---|---|
| < 0.5x | Dataset may be too small |
| 0.5x - 2.0x | Good range |
| > 2.0x | Large dataset, may take longer |
7. Recommendations
Based on analysis, suggest:
standardize_sharegpt()for ShareGPT data- Sequence length adjustments
- Learning rate for small datasets
8. Optional: HF Upload
Offer to upload local datasets to Hub.
9. Handoff
Pass context to funsloth-train:
dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2
Bundled Resources
- scripts/validate_dataset.py - Automated validation script
- DATA_FORMATS.md - Dataset format reference
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
funsloth-upload
Generate comprehensive model cards and upload fine-tuned models to Hugging Face Hub with professional documentation
funsloth-train
Generate Unsloth training notebooks and scripts. Use when the user wants to create a training notebook, configure fine-tuning parameters, or set up SFT/DPO/GRPO training.
funsloth-hfjobs
Training manager for Hugging Face Jobs - launch fine-tuning on HF cloud GPUs with optional WandB monitoring
funsloth-local
Training manager for local GPU training - validate CUDA, manage GPU selection, monitor progress, handle checkpoints
funsloth-runpod
Training manager for RunPod GPU instances - configure pods, launch training, monitor progress, retrieve checkpoints
edit-article
Edit and improve articles by restructuring sections, improving clarity, and tightening prose. Use when user wants to edit, revise, or improve an article draft.
Didn't find tool you were looking for?