Agent skill
monitor-experiment
Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.
Install this agent skill to your Project
npx add-skill https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep/tree/main/skills/monitor-experiment
SKILL.md
Monitor Experiment Results
Monitor: $ARGUMENTS
Workflow
Step 1: Check What's Running
SSH server:
ssh <server> "screen -ls"
Vast.ai instance (read ssh_host, ssh_port from vast-instances.json):
ssh -p <PORT> root@<HOST> "screen -ls"
Also check vast.ai instance status:
vastai show instances
Modal (when gpu: modal in CLAUDE.md):
modal app list # List running/recent apps
modal app logs <app> # Stream logs from a running app
Modal apps auto-terminate when done — if it's not in the list, it already finished. Check results via modal volume ls <volume> or local output.
Step 2: Collect Output from Each Screen
For each screen session, capture the last N lines:
ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"
If hardcopy fails, check for log files or tee output.
Step 3: Check for JSON Result Files
ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"
If JSON results exist, fetch and parse them:
ssh <server> "cat <results_dir>/<latest>.json"
Step 3.5: Pull W&B Metrics (when wandb: true in CLAUDE.md)
Skip this step entirely if wandb is not set or is false in CLAUDE.md.
Pull training curves and metrics from Weights & Biases via Python API:
# List recent runs in the project
ssh <server> "python3 -c \"
import wandb
api = wandb.Api()
runs = api.runs('<entity>/<project>', per_page=10)
for r in runs:
print(f'{r.id} {r.state} {r.name} {r.summary.get(\"eval/loss\", \"N/A\")}')
\""
# Pull specific metrics from a run (last 50 steps)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
history = list(run.scan_history(keys=['train/loss', 'eval/loss', 'eval/ppl', 'train/lr'], page_size=50))
print(json.dumps(history[-10:], indent=2))
\""
# Pull run summary (final metrics)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
print(json.dumps(dict(run.summary), indent=2, default=str))
\""
What to extract:
- Training loss curve — is it converging? diverging? plateauing?
- Eval metrics — loss, PPL, accuracy at latest checkpoint
- Learning rate — is the schedule behaving as expected?
- GPU memory — any OOM risk?
- Run status — running / finished / crashed?
W&B dashboard link (include in summary for user):
https://wandb.ai/<entity>/<project>/runs/<run_id>
This gives the auto-review-loop richer signal than just screen output — training dynamics, loss curves, and metric trends over time.
Step 4: Summarize Results
Present results in a comparison table:
| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline | X.XX | — | done |
| Method A | X.XX | +Y.Y | done |
Step 5: Interpret
- Compare against known baselines
- Flag unexpected results (negative delta, NaN, divergence)
- Suggest next steps based on findings
Step 6: Feishu Notification (if configured)
After results are collected, check ~/.claude/feishu.json:
- Send
experiment_donenotification: results summary table, delta vs baseline - If config absent or mode
"off": skip entirely (no-op)
Key Rules
- Always show raw numbers before interpretation
- Compare against the correct baseline (same config)
- Note if experiments are still running (check progress bars, iteration counts)
- If results look wrong, check training logs for errors before concluding
- Vast.ai cost awareness: When monitoring vast.ai instances, report the running cost (hours * $/hr from
vast-instances.json). If all experiments on an instance are done, remind the user to run/vast-gpu destroy <instance_id>to stop billing - Modal cost awareness: Modal auto-scales to zero — no idle billing. When reporting results from Modal runs, note the actual execution time and estimated cost (time * $/hr from the GPU tier used). No cleanup action needed
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
ablation-planner
Use when main results pass result-to-claim (claim_supported=yes or partial) and ablation studies are needed for paper submission. Codex designs ablations from a reviewer's perspective, CC reviews feasibility and implements.
paper-plan
Generate a structured paper outline from review conclusions and experiment results. Use when user says "写大纲", "paper outline", "plan the paper", "论文规划", or wants to create a paper plan before writing.
idea-discovery-robot
Workflow 1 adaptation for robotics and embodied AI. Orchestrates robotics-aware literature survey, idea generation, novelty check, and critical review to go from a broad robotics direction to benchmark-grounded, simulation-first ideas. Use when user says "robotics idea discovery", "机器人找idea", "embodied AI idea", "机器人方向探索", "sim2real 选题", or wants ideas for manipulation, locomotion, navigation, drones, humanoids, or general robot learning.
training-check
Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
paper-plan
Generate a structured paper outline from review conclusions and experiment results. Use when user says "写大纲", "paper outline", "plan the paper", "论文规划", or wants to create a paper plan before writing.
idea-discovery-robot
Workflow 1 adaptation for robotics and embodied AI. Orchestrates robotics-aware literature survey, idea generation, novelty check, and critical review to go from a broad robotics direction to benchmark-grounded, simulation-first ideas. Use when user says \"robotics idea discovery\", \"机器人找idea\", \"embodied AI idea\", \"机器人方向探索\", \"sim2real 选题\", or wants ideas for manipulation, locomotion, navigation, drones, humanoids, or general robot learning.
Didn't find tool you were looking for?