Agent skill
mojo-simd-optimize
Apply SIMD optimizations to Mojo code for parallel computation. Use when optimizing performance-critical tensor and array operations.
Stars
163
Forks
31
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/mojo-simd-optimize
SKILL.md
SIMD Optimization Skill
Parallelize tensor and array operations using SIMD.
When to Use
- Optimizing tensor operations
- Vectorizing element-wise computations
- Performance-critical loops (>1000 elements)
- Benchmark results show optimization potential
Quick Reference
mojo
from sys.info import simdwidthof
comptime width = simdwidthof[DType.float32]()
# SIMD vector add
for i in range(0, size, width):
result.store(i, a.load[width](i) + b.load[width](i))
Workflow
- Identify bottleneck - Profile code to find hot loops
- Get SIMD width - Use
simdwidthof[dtype]() - Vectorize loop - Process
widthelements per iteration - Handle remainder - Process leftover elements
- Benchmark - Verify performance improvement (4x-8x expected)
Mojo-Specific Notes
- SIMD width varies by CPU and dtype (usually 8-16 for float32)
- Always handle remainder elements with scalar loop
- Prefer
aliasfor compile-time SIMD width constants - Test on target hardware - SIMD width is platform-specific
Error Handling
| Error | Cause | Solution |
|---|---|---|
Out of bounds |
Remainder not handled | Add scalar remainder loop |
No speedup |
Wrong SIMD width | Use simdwidthof[dtype]() |
Compilation fails |
Type mismatch | Check load/store types match |
Segfault |
Misaligned access | Ensure stride is correct |
References
.claude/shared/mojo-guidelines.md- SIMD patterns section- Mojo manual: SIMD documentation
Didn't find tool you were looking for?