Agent skill

audio-generation

Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.

View SKILL.md on GitHub Repository

Stars 914

Forks 144

Install this agent skill to your Project

npx add-skill https://github.com/massgen/MassGen/tree/main/massgen/skills/audio-generation

SKILL.md

Audio Generation

Generate audio using generate_media with mode="audio". Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.

Quick Start

python

# Text-to-speech (auto-selects ElevenLabs if key available)
generate_media(prompt="Hello, welcome to our presentation!", mode="audio")

# With specific voice
generate_media(prompt="Hello!", mode="audio", voice="Rachel")

# Music generation (ElevenLabs only)
generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio",
               audio_type="music", duration=30)

# Sound effects (ElevenLabs only)
generate_media(prompt="Thunder rolling across a mountain valley", mode="audio",
               audio_type="sound_effect", duration=5)

Audio Types

Type	Backends	Description
`"speech"` (default)	ElevenLabs, OpenAI	Text-to-speech with voice selection
`"music"`	ElevenLabs only	Music generation from text prompt
`"sound_effect"`	ElevenLabs only	Sound effect generation
`"voice_conversion"`	ElevenLabs only	Change voice of existing audio (speech-to-speech)
`"audio_isolation"`	ElevenLabs only	Remove background noise, isolate vocals
`"voice_design"`	ElevenLabs only	Create a new synthetic voice from text description
`"voice_clone"`	ElevenLabs only	Clone a voice from audio samples
`"dubbing"`	ElevenLabs only	Translate and dub audio to another language

Backend Comparison

Backend	Default Model	Supports	API Key
ElevenLabs (priority 1)	`eleven_multilingual_v2`	Speech, music, SFX	`ELEVENLABS_API_KEY`
OpenAI (priority 2)	`gpt-4o-mini-tts`	Speech only	`OPENAI_API_KEY`

If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.

Key Parameters

Parameter	Description	Example
`prompt`	Text to speak (speech) or description (music/SFX)	`"Hello world!"`
`voice`	Voice name or ID	`"Rachel"`, `"nova"`, `"alloy"`
`audio_type`	Type of audio	`"speech"`, `"music"`, `"sound_effect"`
`duration`	Length in seconds (music/SFX only)	`30`
`instructions`	Speaking style (OpenAI `gpt-4o-mini-tts` only)	`"warm, reflective tone"`
`audio_format`	Output format	`"mp3"`, `"wav"`, `"opus"`

Voice Quick Reference

ElevenLabs (top voices):

Voice	Character
Rachel	Warm, conversational female
Sarah	Clear, professional female
Josh	Friendly male
Adam	Deep, authoritative male
Emily	Bright, energetic female

OpenAI voices: alloy, echo, fable, onyx, nova, shimmer, coral, sage

Important: prompt vs instructions

For speech, prompt is the literal text to speak. Style guidance goes in instructions:

python

# CORRECT: prompt = text to speak, instructions = how to speak it
generate_media(
    prompt="Welcome to the annual report presentation.",
    mode="audio",
    voice="alloy",
    instructions="warm, reflective tone with measured pacing",
    backend_type="openai"
)

# WRONG: Don't put style instructions in prompt
generate_media(prompt="Say this warmly: Welcome...", mode="audio")  # Bad!

instructions only works with OpenAI gpt-4o-mini-tts. ElevenLabs uses voice selection for tone.

Audio Understanding

Use read_media (not generate_media) to analyze existing audio:

python

read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")

Need More Control?

Full ElevenLabs voice catalog (28+ voices): See references/voices.md
Music and sound effects details: See references/music_and_sfx.md
Advanced audio capabilities (voice conversion, cloning, isolation, dubbing): See references/advanced.md

Maintainer

massgen Core maintainer

Source details

Full Name: massgen/MassGen
Branch: main
Path in repo: massgen/skills/audio-generation
License: Other
Topics: agent cli llm model-context-protocol python agentic-ai autonomous-agents multi-agent llm-orchestration genai generative-ai collaborative-ai conversational-ai terminal-ui test-time-scaling tool-calling

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

massgen/MassGen

textual-ui-developer

Develop and debug the MassGen Textual TUI with deterministic replay, snapshot regression tests, and targeted runtime checks.

914 144

Explore

massgen/MassGen

evolving-skill-creator

Guide for creating evolving skills - detailed workflow plans that capture what you'll do, what tools you'll create, and learnings from execution. Use this when starting a new task that could benefit from a reusable workflow.

914 144

Explore

massgen/MassGen

pr-checks

Run comprehensive PR checks including reviewing CodeRabbit comments, ensuring PR description quality, running pre-commit hooks, tests, and validation. Use on an existing PR to address review feedback.

914 144

Explore

massgen/MassGen

serena

This skill provides symbol-level code understanding and navigation using Language Server Protocol (LSP). Enables IDE-like capabilities for finding symbols, tracking references, and making precise code edits at the symbol level.

914 144

Explore

massgen/MassGen

massgen-config-creator

Guide for creating properly structured YAML configuration files for MassGen. This skill should be used when agents need to create new configs for examples, case studies, testing, or demonstrating features.

914 144

Explore

massgen/MassGen

semtools

This skill provides semantic search capabilities using embedding-based similarity matching for code and text. Enables meaning-based search beyond keyword matching, with optional document parsing (PDF, DOCX, PPTX) support.

914 144

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Audio Generation

Quick Start

Audio Types

Backend Comparison

Key Parameters

Voice Quick Reference

Important: prompt vs instructions

Audio Understanding

Need More Control?

Recommended Agent Skills

textual-ui-developer

evolving-skill-creator

pr-checks

serena

massgen-config-creator

semtools