MiniMax Audio Models
Overview
MiniMax provides audio capabilities in StoryFlow, including voice cloning and text-to-speech (TTS). Use these when you need consistent narration voices across scenes.
MiniMax Voice Clone
What it does
Create a reusable voice ID by cloning timbre and speaking style from an audio sample.
Inputs
- Audio Sample (Required): MP3/M4A/WAV, 10s–5min (max 20MB).
- Demo Text (Required): used to generate a short preview after cloning.
Parameters
| Parameter | Type | Default | Allowed | What it does |
|---|---|---|---|---|
voice_model | string | speech-2.5-hd-preview | speech-2.5-hd-preview | Selects the cloning model. |
accuracy | number | 0.8 | 0.0 – 1.0 | Controls how strictly the cloned voice matches the sample. Higher = closer match. |
need_noise_reduction | boolean | true | true, false | Enables noise reduction on the sample before cloning. |
need_volume_normalization | boolean | true | true, false | Normalizes volume for more consistent output. |
Tips
- Use clean, single-speaker audio with minimal music/noise.
- Provide 30–90 seconds of steady speech for better timbre stability.
- Voice IDs are temporary unless you use them in a TTS request within 7 days.
MiniMax TTS
What it does
Generate speech from text, using either system preset voices or your custom cloned voice ID.
Inputs
- Text Input (Required): the content to speak.
Parameters
| Parameter | Type | Default | Allowed | What it does |
|---|---|---|---|---|
voice_model | string | speech-2.6-turbo | speech-2.6-turbo, speech-2.6-hd | Selects the TTS model. Turbo is faster; HD is higher quality. |
voice_id | string | male-qn-qingse | Preset list | Selects a system timbre (preset voice). Disabled when use_custom_voice=true. |
use_custom_voice | boolean | false | true, false | Enables using your cloned voice ID instead of a preset timbre. |
custom_voice_id | string | "" | - | Your cloned voice ID (typically starts with voice_clone_). Only available when use_custom_voice=true. |
emotion | string | neutral | neutral, happy, sad, angry, fearful, disgusted, surprised | Controls emotional tone (supported in MiniMax 2.6 models). |
text_normalization | boolean | false | true, false | Improves reading of numbers/dates/symbols in English, with slight added latency. |
speed | number | 1.0 | 0.5 – 2.0 | Controls speaking rate. |
vol | number | 1.0 | 0.1 – 10.0 | Controls output volume. |
pitch | number | 0 | -12 – 12 (integer) | Controls vocal pitch shift. |
Tips
- Split long scripts into smaller paragraphs for more controllable pacing.
- Keep the same voice settings across all scenes to maintain consistency.