Agent skill

tts

Text-to-speech and voice narration. Triggers on: "朗读这段", "配音", "TTS", "语音合成", "text to speech", "read this aloud", "convert to speech", "voice narration", "read aloud".

Stars 37
Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/marswaveai/skills/tree/main/tts

Metadata

Additional technical details for this skill

openclaw
{
    "emoji": "\ud83d\udd0a",
    "requires": {
        "env": [
            "LISTENHUB_API_KEY"
        ]
    },
    "primaryEnv": "LISTENHUB_API_KEY"
}

SKILL.md

When to Use

  • User wants to convert text to spoken audio
  • User asks for "read aloud", "TTS", "text to speech", "voice narration"
  • User says "朗读", "配音", "语音合成"
  • User wants multi-speaker scripted audio or dialogue

When NOT to Use

  • User wants a podcast-style discussion with topic exploration (use /podcast)
  • User wants an explainer video with visuals (use /explainer)
  • User wants to generate an image (use /image-gen)

Purpose

Convert text into natural-sounding speech audio. Two paths:

  1. Quick mode (/v1/tts): Single voice, low-latency, sync MP3 stream. For casual chat, reading snippets, instant audio.
  2. Script mode (/v1/speech): Multi-speaker, per-segment voice assignment. For dialogue, audiobooks, scripted content.

Hard Constraints

  • No shell scripts. Construct curl commands from the API reference files listed in Resources
  • Always read shared/authentication.md for API key and headers
  • Follow shared/common-patterns.md for errors and interaction patterns
  • Never hardcode speaker IDs in API calls — use built-in defaults from shared/speaker-selection.md as fallback only; fetch from the speakers API when the user wants to change voice
  • Always read config following shared/config-pattern.md before any interaction
  • Always follow shared/speaker-selection.md for speaker selection (text table + free-text input)
  • Never save files to ~/Downloads/ or /tmp/ as primary output — save artifacts to the current working directory with friendly topic-based names (see shared/config-pattern.md § Artifact Naming)

Mode Detection

Determine the mode from the user's input automatically before asking any questions:

Signal Mode
"多角色", "脚本", "对话", "script", "dialogue", "multi-speaker" Script
Multiple characters mentioned by name or role Script
Input contains structured segments (A: ..., B: ...) Script
Single paragraph of text, no character markers Quick
"读一下", "read this", "TTS", "朗读" with plain text Quick
Ambiguous Quick (default)

Interaction Flow

Step -1: API Key Check

Follow shared/config-pattern.md § API Key Check. If the key is missing, stop immediately.

Step 0: Config Setup

Follow shared/config-pattern.md Step 0 (Zero-Question Boot).

If file doesn't exist — silently create with defaults and proceed:

bash
mkdir -p ".listenhub/tts"
echo '{"outputMode":"inline","language":null,"defaultSpeakers":{}}' > ".listenhub/tts/config.json"
CONFIG_PATH=".listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")

Do NOT ask any setup questions. Proceed directly to the Interaction Flow.

If file exists — read config silently and proceed:

bash
CONFIG_PATH=".listenhub/tts/config.json"
[ ! -f "$CONFIG_PATH" ] && CONFIG_PATH="$HOME/.listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")

Setup Flow (user-initiated reconfigure only)

Only run when the user explicitly asks to reconfigure. Display current settings:

当前配置 (tts):
  输出方式:{inline / download / both}
  语言偏好:{zh / en / 未设置}
  默认主播:{speakerName / 使用内置默认}

Then ask:

  1. outputMode: Follow shared/output-mode.md § Setup Flow Question.

  2. Language (optional): "默认语言?"

    • "中文 (zh)"
    • "English (en)"
    • "每次手动选择" → keep null

After collecting answers, save immediately:

bash
NEW_CONFIG=$(echo "$CONFIG" | jq --arg m "$OUTPUT_MODE" '. + {"outputMode": $m}')
# Save language if user chose one (not "每次手动选择")
if [ "$LANGUAGE" != "null" ]; then
  NEW_CONFIG=$(echo "$NEW_CONFIG" | jq --arg lang "$LANGUAGE" '. + {"language": $lang}')
fi
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")

Quick Mode — POST /v1/tts

Step 1: Extract text

Get the text to convert. If the user hasn't provided it, ask:

"What text would you like me to read aloud?"

Step 2: Determine voice

  • If config.defaultSpeakers.{language}[0] is set → use it silently (skip to Step 4)
  • If not set → use the built-in default from shared/speaker-selection.md for the detected language (skip to Step 4)
  • Only show speaker selection if the user explicitly asks to change voice

Step 3: Save preference

After the user explicitly selects a new voice (not when using defaults):

Question: "Save {voice name} as your default voice for {language}?"
Options:
  - "Yes" — update .listenhub/tts/config.json
  - "No" — use for this session only

Step 4: Confirm

Ready to generate:

  Text: "{first 80 chars}..."
  Voice: {voice name}

Proceed?

Step 5: Generate

bash
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-output.mp3

Step 6: Present result

Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.

Use a timestamped jobId: $(date +%s)

inline or both (TTS quick returns a sync audio stream — no audioUrl):

bash
JOB_ID=$(date +%s)
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-${JOB_ID}.mp3

Then use the Read tool on /tmp/tts-{jobId}.mp3.

Present:

Audio generated!

download or both: Generate a topic slug from the text content following shared/config-pattern.md § Artifact Naming.

bash
SLUG="{topic-slug}"  # e.g. "server-maintenance-notice"
NAME="${SLUG}.mp3"
# Dedup: if file exists, append -2, -3, etc.
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output "$NAME"

Present:

Audio generated!

已保存到当前目录:
  {NAME}

Script Mode — POST /v1/speech

Step 1: Get scripts

Determine whether the user already has a scripts array:

  • Already provided (JSON or clear segments): parse and display for confirmation

  • Not yet provided: help the user structure segments. Ask:

    "Please provide the script with speaker assignments. Format: each line as SpeakerName: text content. I'll convert it."

    Once the user provides the script, parse it into the scripts JSON format.

Step 2: Assign voices per character

For each unique character in the script:

  • If config.defaultSpeakers.{language} has saved voices → auto-assign silently (one per character in order)
  • If not set → use built-in defaults from shared/speaker-selection.md (Primary for first character, Secondary for second)
  • Only show speaker selection if the user explicitly asks to change voices

Step 3: Save preferences

After all voices are assigned (if any were new):

Question: "Save these voice assignments for future sessions?"
Options:
  - "Yes" — update defaultSpeakers in .listenhub/tts/config.json
  - "No" — use for this session only

Step 4: Confirm

Ready to generate:

  Characters:
    {name}: {voice}
    {name}: {voice}
  Segments: {count}
  Title: (auto-generated)

Proceed?

Step 5: Generate

Write the request body to a temp file, then submit:

bash
# Write request to temp file
cat > /tmp/lh-speech-request.json << 'ENDJSON'
{
  "scripts": [
    {"content": "...", "speakerId": "..."},
    {"content": "...", "speakerId": "..."}
  ]
}
ENDJSON

# Submit
curl -sS -X POST "https://api.marswave.ai/openapi/v1/speech" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d @/tmp/lh-speech-request.json

rm /tmp/lh-speech-request.json

Step 6: Present result

Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.

inline or both: Display the audioUrl and subtitlesUrl as clickable links.

Present:

Audio generated!

在线收听:{audioUrl}
字幕:{subtitlesUrl}
时长:{audioDuration / 1000}s
消耗积分:{credits}

download or both: Also download the file. Generate a topic slug following shared/config-pattern.md § Artifact Naming.

bash
SLUG="{topic-slug}"  # e.g. "welcome-dialogue"
NAME="${SLUG}.mp3"
# Dedup: if file exists, append -2, -3, etc.
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -o "$NAME" "{audioUrl}"

Present:

已保存到当前目录:
  {NAME}

Updating Config

When saving preferences, merge into .listenhub/tts/config.json — do not overwrite unchanged keys.

  • Quick voice: set defaultSpeakers.{language}[0] to the selected speakerId
  • Script voices: set defaultSpeakers.{language} to the full array assigned this session
  • Language: set language if the user explicitly specifies it

API Reference

  • TTS & Speech endpoints: shared/api-tts.md
  • Speaker list: shared/api-speakers.md
  • Speaker selection guide: shared/speaker-selection.md
  • Error handling: shared/common-patterns.md § Error Handling
  • Long text input: shared/common-patterns.md § Long Text Input

Composability

  • Invokes: speakers API (for speaker selection)
  • Invoked by: explainer (for voiceover)

Examples

Quick mode:

"TTS this: The server will be down for maintenance at midnight."

  1. Detect: Quick mode (plain text, "TTS this")
  2. Read config: quickVoice is null
  3. Fetch speakers, user picks "Yuanye"
  4. Ask to save → yes → update config
  5. POST /v1/tts with input + voice
  6. Present: /tmp/tts-output.mp3

Script mode:

"帮我做一段双人对话配音,A说:欢迎大家,B说:谢谢邀请"

  1. Detect: Script mode ("双人对话")
  2. Parse segments: A → "欢迎大家", B → "谢谢邀请"
  3. Read config: scriptVoices empty
  4. Fetch zh speakers, assign A and B voices
  5. Ask to save → yes → update config
  6. POST /v1/speech with scripts array
  7. Present: audioUrl, subtitlesUrl, duration

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results