Agent skill
system-profile
Profile a target (script, process, GPU, memory, interconnect) using external tools and code instrumentation. Produces structured performance reports with actionable recommendations. Use when user says "profile", "benchmark", "bottleneck", or wants performance analysis.
Install this agent skill to your Project
npx add-skill https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep/tree/main/skills/system-profile
SKILL.md
System Profile
Profile the specified target and summarize the results. Target: $ARGUMENTS
Instructions
You are a profiling assistant. Based on the user's target, choose appropriate profiling strategies, including writing instrumentation code when needed, then run profiling, analyze results, and produce a summary.
Step 1: Determine the profiling target
Parse $ARGUMENTS to understand what to profile. Examples:
- A Python script or module
- A running process (PID or service name)
- A specific function or code block
- An entire framework or system (e.g., "autogen", "vllm serving") — profile its end-to-end execution, identify bottlenecks across components
- "gpu" / "interconnect" / "memory" for focused profiling
If $ARGUMENTS is empty or unclear, ask the user.
Step 2: Choose profiling methods
Select from external tools and/or code instrumentation as appropriate. Don't limit yourself to the examples below — use whatever makes sense for the target.
External tools (check availability first):
- CPU:
cProfile,py-spy,line_profiler,perf stat,/usr/bin/time -v - Memory:
tracemalloc,memory_profiler,memray - GPU:
nvidia-smi,nvidia-smi dmon,nvitop,torch.profiler,nsys - Interconnect:
nvidia-smi topo -m,nvidia-smi nvlink,NCCL_DEBUG=INFO - System:
strace -c,iostat,vmstat
Code instrumentation — when external tools are insufficient, write and insert profiling code into the target. Typical scenarios:
- Timing specific code blocks (wall time vs CPU time)
- Measuring CPU-GPU or GPU-GPU transfer size, frequency, and bandwidth
- Tracking memory allocation across CPU and GPU to detect redundancy
- Wrapping NCCL collectives to measure latency and throughput
- Adding CUDA event timing around kernels
Design the instrumentation based on what you observe in the code — don't use a fixed template.
Step 3: Key dimensions to investigate
Depending on the target, focus on some or all of these:
CPU overhead
- Context switching (voluntary / involuntary)
- CPU utilization: ratio of CPU time to wall time
- Per-function execution time hotspots
Memory overhead
- CPU and GPU memory usage (allocated vs reserved vs peak)
- Redundant replication: same data living on both CPU and GPU
- Per-device allocation balance in multi-GPU setups
Interconnect & communication
- CPU-GPU transfer: frequency, per-transfer size, total volume, bandwidth achieved
- GPU-GPU transfer: P2P bandwidth, NVLink vs PCIe topology impact
- NCCL collectives: operation type, message size distribution, latency
- Communication-to-computation ratio
GPU compute
- SM utilization, kernel launch overhead
- Memory bandwidth utilization vs peak
Step 4: Instrumentation guidelines
When inserting code into the target:
- Read and understand the target code first
- Prefer wrapping (decorator, context manager, standalone runner) over inline edits
- If inline edits are necessary, mark them clearly (e.g.,
# [PROFILE]comments) - Minimize observer effect — don't instrument tight inner loops; sample instead
- Collect results into a structured log, don't scatter print statements
Step 5: Run profiling
- Check available tools and hardware topology
- Run the chosen methods, capture all output
- Save artifacts (flamegraphs, traces, logs) to
./profile_output/
Step 6: Produce the report
Part A — Profiling results (structured tables by dimension, as applicable):
- CPU overhead table
- Memory overhead table (with redundancy column)
- Interconnect table (transfer type / frequency / size / latency / bandwidth)
- Hotspots / bottleneck identification
- Actionable recommendations ranked by expected impact
Part B — Instrumentation changelog (MANDATORY): List every file that was modified or created for profiling purposes:
| File | Change type | What was added/modified | Line(s) |
|---|---|---|---|
| ... | modified | ... | ... |
| ... | created | ... | — |
This allows the user to review and revert all instrumentation changes. Offer to clean up (remove all instrumentation) when the user is done.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
ablation-planner
Use when main results pass result-to-claim (claim_supported=yes or partial) and ablation studies are needed for paper submission. Codex designs ablations from a reviewer's perspective, CC reviews feasibility and implements.
paper-plan
Generate a structured paper outline from review conclusions and experiment results. Use when user says "写大纲", "paper outline", "plan the paper", "论文规划", or wants to create a paper plan before writing.
idea-discovery-robot
Workflow 1 adaptation for robotics and embodied AI. Orchestrates robotics-aware literature survey, idea generation, novelty check, and critical review to go from a broad robotics direction to benchmark-grounded, simulation-first ideas. Use when user says "robotics idea discovery", "机器人找idea", "embodied AI idea", "机器人方向探索", "sim2real 选题", or wants ideas for manipulation, locomotion, navigation, drones, humanoids, or general robot learning.
training-check
Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
paper-plan
Generate a structured paper outline from review conclusions and experiment results. Use when user says "写大纲", "paper outline", "plan the paper", "论文规划", or wants to create a paper plan before writing.
idea-discovery-robot
Workflow 1 adaptation for robotics and embodied AI. Orchestrates robotics-aware literature survey, idea generation, novelty check, and critical review to go from a broad robotics direction to benchmark-grounded, simulation-first ideas. Use when user says \"robotics idea discovery\", \"机器人找idea\", \"embodied AI idea\", \"机器人方向探索\", \"sim2real 选题\", or wants ideas for manipulation, locomotion, navigation, drones, humanoids, or general robot learning.
Didn't find tool you were looking for?