Agent skills
monitor-experiment

Agent skill

monitor-experiment

Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.

View SKILL.md on GitHub Repository

Stars 6,306

Forks 582

Install this agent skill to your Project

npx add-skill https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep/tree/main/skills/monitor-experiment

SKILL.md

Monitor Experiment Results

Monitor: $ARGUMENTS

Workflow

Step 1: Check What's Running

SSH server:

bash

ssh <server> "screen -ls"

Vast.ai instance (read ssh_host, ssh_port from vast-instances.json):

bash

ssh -p <PORT> root@<HOST> "screen -ls"

Also check vast.ai instance status:

bash

vastai show instances

Modal (when gpu: modal in CLAUDE.md):

bash

modal app list         # List running/recent apps
modal app logs <app>   # Stream logs from a running app

Modal apps auto-terminate when done — if it's not in the list, it already finished. Check results via modal volume ls <volume> or local output.

Step 2: Collect Output from Each Screen

For each screen session, capture the last N lines:

bash

ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"

If hardcopy fails, check for log files or tee output.

Step 3: Check for JSON Result Files

bash

ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"

If JSON results exist, fetch and parse them:

bash

ssh <server> "cat <results_dir>/<latest>.json"

Step 3.5: Pull W&B Metrics (when `wandb: true` in CLAUDE.md)

Skip this step entirely if wandb is not set or is false in CLAUDE.md.

Pull training curves and metrics from Weights & Biases via Python API:

bash

# List recent runs in the project
ssh <server> "python3 -c \"
import wandb
api = wandb.Api()
runs = api.runs('<entity>/<project>', per_page=10)
for r in runs:
    print(f'{r.id}  {r.state}  {r.name}  {r.summary.get(\"eval/loss\", \"N/A\")}')
\""

# Pull specific metrics from a run (last 50 steps)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
history = list(run.scan_history(keys=['train/loss', 'eval/loss', 'eval/ppl', 'train/lr'], page_size=50))
print(json.dumps(history[-10:], indent=2))
\""

# Pull run summary (final metrics)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
print(json.dumps(dict(run.summary), indent=2, default=str))
\""

What to extract:

Training loss curve — is it converging? diverging? plateauing?
Eval metrics — loss, PPL, accuracy at latest checkpoint
Learning rate — is the schedule behaving as expected?
GPU memory — any OOM risk?
Run status — running / finished / crashed?

W&B dashboard link (include in summary for user):

https://wandb.ai/<entity>/<project>/runs/<run_id>

This gives the auto-review-loop richer signal than just screen output — training dynamics, loss curves, and metric trends over time.

Step 4: Summarize Results

Present results in a comparison table:

| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline  | X.XX   | —                 | done   |
| Method A  | X.XX   | +Y.Y              | done   |

Step 5: Interpret

Compare against known baselines
Flag unexpected results (negative delta, NaN, divergence)
Suggest next steps based on findings

Step 6: Feishu Notification (if configured)

After results are collected, check ~/.claude/feishu.json:

Send experiment_done notification: results summary table, delta vs baseline
If config absent or mode "off": skip entirely (no-op)

Key Rules

Always show raw numbers before interpretation
Compare against the correct baseline (same config)
Note if experiments are still running (check progress bars, iteration counts)
If results look wrong, check training logs for errors before concluding
Vast.ai cost awareness: When monitoring vast.ai instances, report the running cost (hours * $/hr from vast-instances.json). If all experiments on an instance are done, remind the user to run /vast-gpu destroy <instance_id> to stop billing
Modal cost awareness: Modal auto-scales to zero — no idle billing. When reporting results from Modal runs, note the actual execution time and estimated cost (time * $/hr from the GPU tier used). No cleanup action needed

Maintainer

wanshuiyin Core maintainer

Source details

Full Name: wanshuiyin/Auto-claude-code-research-in-sleep
Branch: main
Path in repo: skills/monitor-experiment
License: MIT License
Topics: claude-code claude claude-code-skills mcp mcp-server llm codex gpt openai ai-tools machine-learning ai-research autonomous-agent deep-learning paper-review research-automation paper-writing aris idea-generation ml-research

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

wanshuiyin/Auto-claude-code-research-in-sleep

ablation-planner

Use when main results pass result-to-claim (claim_supported=yes or partial) and ablation studies are needed for paper submission. Codex designs ablations from a reviewer's perspective, CC reviews feasibility and implements.

6,306 582

Explore

wanshuiyin/Auto-claude-code-research-in-sleep

paper-plan

Generate a structured paper outline from review conclusions and experiment results. Use when user says "写大纲", "paper outline", "plan the paper", "论文规划", or wants to create a paper plan before writing.

6,306 582

Explore

wanshuiyin/Auto-claude-code-research-in-sleep

idea-discovery-robot

Workflow 1 adaptation for robotics and embodied AI. Orchestrates robotics-aware literature survey, idea generation, novelty check, and critical review to go from a broad robotics direction to benchmark-grounded, simulation-first ideas. Use when user says "robotics idea discovery", "机器人找idea", "embodied AI idea", "机器人方向探索", "sim2real 选题", or wants ideas for manipulation, locomotion, navigation, drones, humanoids, or general robot learning.

6,306 582

Explore

wanshuiyin/Auto-claude-code-research-in-sleep

training-check

Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.

6,306 582

Explore

wanshuiyin/Auto-claude-code-research-in-sleep

paper-plan

6,306 582

Explore

wanshuiyin/Auto-claude-code-research-in-sleep

idea-discovery-robot

Workflow 1 adaptation for robotics and embodied AI. Orchestrates robotics-aware literature survey, idea generation, novelty check, and critical review to go from a broad robotics direction to benchmark-grounded, simulation-first ideas. Use when user says \"robotics idea discovery\", \"机器人找idea\", \"embodied AI idea\", \"机器人方向探索\", \"sim2real 选题\", or wants ideas for manipulation, locomotion, navigation, drones, humanoids, or general robot learning.

6,306 582

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Monitor Experiment Results

Workflow

Step 1: Check What's Running

Step 2: Collect Output from Each Screen

Step 3: Check for JSON Result Files

Step 3.5: Pull W&B Metrics (when wandb: true in CLAUDE.md)

Step 4: Summarize Results

Step 5: Interpret

Step 6: Feishu Notification (if configured)

Key Rules

Recommended Agent Skills

ablation-planner

paper-plan

idea-discovery-robot

training-check

paper-plan

idea-discovery-robot

Step 3.5: Pull W&B Metrics (when `wandb: true` in CLAUDE.md)