Agent skill
ai-multimodal
Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
Install this agent skill to your Project
npx add-skill https://github.com/binjuhor/shadcn-lar/tree/main/.claude/skills/ai-multimodal
SKILL.md
AI Multimodal
Process audio, images, videos, documents, and generate images/videos using Google Gemini's multimodal API.
Setup
export GEMINI_API_KEY="your-key" # Get from https://aistudio.google.com/apikey
pip install google-genai python-dotenv pillow
Quick Start
Verify setup: python scripts/check_setup.py
Analyze media: python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>
- TIP: When you're asked to analyze an image, check if
geminicommand is available, then use"<prompt to analyze image>" | gemini -y -m gemini-2.5-flashcommand. Ifgeminicommand is not available, usepython scripts/gemini_batch_process.py --files <file> --task analyzecommand. Generate content:python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"
Stdin support: You can pipe files directly via stdin (auto-detects PNG/JPG/PDF/WAV/MP3).
cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"python scripts/gemini_batch_process.py --files image.png --task analyze(traditional)
Models
- Image generation:
imagen-4.0-generate-001(standard),imagen-4.0-ultra-generate-001(quality),imagen-4.0-fast-generate-001(speed) - Video generation:
veo-3.1-generate-preview(8s clips with audio) - Analysis:
gemini-2.5-flash(recommended),gemini-2.5-pro(advanced)
Scripts
gemini_batch_process.py: CLI orchestrator fortranscribe|analyze|extract|generate|generate-videothat auto-resolves API keys, picks sensible default models per task, streams files inline vs File API, and saves structured outputs (text/JSON/CSV/markdown plus generated assets) for Imagen 4 + Veo workflows.media_optimizer.py: ffmpeg/Pillow-based preflight tool that compresses/resizes/converts audio, image, and video inputs, enforces target sizes/bitrates, splits long clips into hour chunks, and batch-processes directories so media stays within Gemini limits.document_converter.py: Gemini-powered converter that uploads PDFs/images/Office docs, applies a markdown-preserving prompt, batches multiple files, auto-names outputs underdocs/assets, and exposes CLI flags for model, prompt, auto-file naming, and verbose logging.check_setup.py: Interactive readiness checker that verifies directory layout, centralized env resolver, required Python deps, and GEMINI_API_KEY availability/format, then performs a live Gemini API call and prints remediation instructions if anything fails.
Use --help for options.
References
Load for detailed guidance:
| Topic | File | Description |
|---|---|---|
| Audio | references/audio-processing.md |
Audio formats and limits, transcription (timestamps, speakers, segments), non-speech analysis, File API vs inline input, TTS models, best practices, cost and token math, and concrete meeting/podcast/interview recipes. |
| Images | references/vision-understanding.md |
Vision capabilities overview, supported formats and models, captioning/classification/VQA, detection and segmentation, OCR and document reading, multi-image workflows, structured JSON output, token costs, best practices, and common product/screenshot/chart/scene use cases. |
| Image Gen | references/image-generation.md |
Imagen 4 and Gemini image model overview, generate_images vs generate_content APIs, aspect ratios and costs, text/image/both modalities, editing and composition, style and quality control, safety settings, best practices, troubleshooting, and common marketing/concept-art/UI scenarios. |
| Video | references/video-analysis.md |
Video analysis capabilities and supported formats, model/context choices, local/inline/YouTube inputs, clipping and FPS control, multi-video comparison, temporal Q&A and scene detection, transcription with visual context, token and cost guidance, and optimization/best-practice patterns. |
| Video Gen | references/video-generation.md |
Veo model matrix, text-to-video and image-to-video quick start, multi-reference and extension flows, camera and timing control, configuration (resolution, aspect, audio, safety), prompt design patterns, performance tips, limitations, troubleshooting, and cost estimates. |
Limits
Formats: Audio (WAV/MP3/AAC, 9.5h), Images (PNG/JPEG/WEBP, 3.6k), Video (MP4/MOV, 6h), PDF (1k pages) Size: 20MB inline, 2GB File API
Resources
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
Debugging
Systematic debugging framework ensuring root cause investigation before fixes. Includes four-phase debugging process, backward call stack tracing, multi-layer validation, and verification protocols. Use when encountering bugs, test failures, unexpected behavior, performance issues, or before claiming work complete. Prevents random fixes, masks over symptoms, and false completion claims.
frontend-design-pro
Creates jaw-dropping, production-ready frontend interfaces AND delivers perfectly matched real photos (Unsplash/Pexels direct links) OR flawless custom image-generation prompts for hero images, backgrounds, and illustrations. Zero AI slop, zero fake URLs.
ui-ux-pro-max
Frontend UI/UX design intelligence - activate FIRST when user requests beautiful, stunning, gorgeous, or aesthetic interfaces. The primary skill for design decisions before implementation. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check frontend UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient.
mcp-builder
Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).
planning
Use when you need to plan technical solutions that are scalable, secure, and maintainable.
Problem-Solving Techniques
Apply systematic problem-solving techniques for complexity spirals (simplification cascades), innovation blocks (collision-zone thinking), recurring patterns (meta-pattern recognition), assumption constraints (inversion exercise), scale uncertainty (scale game), and dispatch when stuck. Techniques derived from Microsoft Amplifier project patterns adapted for immediate application.
Didn't find tool you were looking for?