Audio
POST /v1/audio/speech and /v1/audio/transcriptions — TTS and STT
Two endpoints: text to speech, and speech to text. OpenAI-compatible shapes for both.
Speech (TTS)
Generate audio from text.
Endpoint:
POST /v1/audio/speechAuth: Bearer token or x-api-key. Requires scope inference.audio.
Example
from openai import OpenAI
client = OpenAI(
api_key="sk-mel-<YOUR_API_KEY>",
base_url="https://api.melious.ai/v1",
)
audio = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hamburg, Lübeck, Bremen.",
)
audio.stream_to_file("out.mp3")curl https://api.melious.ai/v1/audio/speech \
-H "Authorization: Bearer sk-mel-<YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-o out.mp3 \
-d '{
"model": "tts-1",
"voice": "alloy",
"input": "Hamburg, Lübeck, Bremen."
}'Request
| Parameter | Type | Default | Description |
|---|---|---|---|
input | string | — | Text to speak. |
model | string | "tts-1" | TTS model ID. |
voice | string | — | "alloy", "echo", "fable", "onyx", "nova", or "shimmer". |
response_format | string | "mp3" | "mp3", "opus", "aac", "flac", "pcm", "wav". |
speed | number | 1.0 | Playback speed, [0.25, 4.0]. |
user | string | none | End-user identifier. |
Response
Binary audio data with Content-Type: audio/<format> — no JSON wrapper. Save directly to disk.
Transcriptions (STT)
Turn audio into text.
Endpoint:
POST /v1/audio/transcriptionsAuth: Bearer token or x-api-key. Requires scope inference.audio.
Content-Type: multipart/form-data.
Max file size: 25 MB.
Example
from openai import OpenAI
client = OpenAI(
api_key="sk-mel-<YOUR_API_KEY>",
base_url="https://api.melious.ai/v1",
)
with open("meeting.mp3", "rb") as f:
result = client.audio.transcriptions.create(
model="whisper-large-v3-turbo",
file=f,
language="de",
)
print(result.text)curl https://api.melious.ai/v1/audio/transcriptions \
-H "Authorization: Bearer sk-mel-<YOUR_API_KEY>" \
-F model="whisper-large-v3-turbo" \
-F language="de" \
-F file=@meeting.mp3Request (multipart fields)
| Field | Type | Default | Description |
|---|---|---|---|
file | file | — | Audio file. Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm. |
model | string | "whisper-large-v3-turbo" | STT model ID. |
language | string | auto-detect | ISO-639-1 code ("de", "fr", "en", …). |
response_format | string | "json" | "json", "text", "srt", "vtt", "verbose_json". |
temperature | number | 0 | Sampling temperature, [0, 1]. |
Whisper supports 50+ languages auto-detected. If you know the language, set it — detection adds a few hundred milliseconds and occasionally picks wrong for short clips.
Response
response_format: "json" (default):
{
"text": "Hamburg, Lübeck, Bremen.",
"language": "de",
"duration": 2.1
}response_format: "verbose_json" adds segments with per-segment timestamps and confidence.
response_format: "text" returns a plain-text body. "srt" and "vtt" return subtitle files.
What about translations?
OpenAI's /v1/audio/translations endpoint (translate audio to English) isn't implemented. Workaround: transcribe with whisper-large-v3-turbo (it auto-handles language), then pipe the text through Chat completions with a translation prompt.
Audio-in chat
A different path from STT: some chat models accept audio as message content directly (e.g. Voxtral variants). That's not this endpoint — it's Chat completions with audio content blocks. Check _meta.capabilities.audio_input on GET /v1/models/{id}?include_meta=true to find them.
Errors
VALIDATION_4016— file exceeds 25 MB.VALIDATION_4005— unsupported audio format.INFERENCE_3001— unknown model.AUTH_1015— missinginference.audioscope.
Related
Models for STT/TTS model discovery • Routing for bulk transcription cost savings.