Audio

Two endpoints: text to speech, and speech to text. OpenAI-compatible shapes for both.

Speech (TTS)

Generate audio from text.

Endpoint:

POST /v1/audio/speech

Auth: Bearer token or x-api-key. Requires scope inference.audio.

Example

from openai import OpenAI

client = OpenAI(
    api_key="sk-mel-<YOUR_API_KEY>",
    base_url="https://api.melious.ai/v1",
)

audio = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hamburg, Lübeck, Bremen.",
)
audio.stream_to_file("out.mp3")

curl https://api.melious.ai/v1/audio/speech \
  -H "Authorization: Bearer sk-mel-<YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -o out.mp3 \
  -d '{
    "model": "tts-1",
    "voice": "alloy",
    "input": "Hamburg, Lübeck, Bremen."
  }'

Request

Parameter	Type	Default	Description
`input`	string	—	Text to speak.
`model`	string	`"tts-1"`	TTS model ID.
`voice`	string	—	`"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, or `"shimmer"`.
`response_format`	string	`"mp3"`	`"mp3"`, `"opus"`, `"aac"`, `"flac"`, `"pcm"`, `"wav"`.
`speed`	number	`1.0`	Playback speed, `[0.25, 4.0]`.
`user`	string	none	End-user identifier.

Response

Binary audio data with Content-Type: audio/<format> — no JSON wrapper. Save directly to disk.

Transcriptions (STT)

Turn audio into text.

Endpoint:

POST /v1/audio/transcriptions

Auth: Bearer token or x-api-key. Requires scope inference.audio. Content-Type: multipart/form-data. Max file size: 25 MB.

Example

from openai import OpenAI

client = OpenAI(
    api_key="sk-mel-<YOUR_API_KEY>",
    base_url="https://api.melious.ai/v1",
)

with open("meeting.mp3", "rb") as f:
    result = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",
        file=f,
        language="de",
    )
print(result.text)

curl https://api.melious.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-mel-<YOUR_API_KEY>" \
  -F model="whisper-large-v3-turbo" \
  -F language="de" \
  -F file=@meeting.mp3

Request (multipart fields)

Field	Type	Default	Description
`file`	file	—	Audio file. Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm.
`model`	string	`"whisper-large-v3-turbo"`	STT model ID.
`language`	string	auto-detect	ISO-639-1 code (`"de"`, `"fr"`, `"en"`, …).
`response_format`	string	`"json"`	`"json"`, `"text"`, `"srt"`, `"vtt"`, `"verbose_json"`.
`temperature`	number	`0`	Sampling temperature, `[0, 1]`.

Whisper supports 50+ languages auto-detected. If you know the language, set it — detection adds a few hundred milliseconds and occasionally picks wrong for short clips.

Response

response_format: "json" (default):

{
  "text": "Hamburg, Lübeck, Bremen.",
  "language": "de",
  "duration": 2.1
}

response_format: "verbose_json" adds segments with per-segment timestamps and confidence.

response_format: "text" returns a plain-text body. "srt" and "vtt" return subtitle files.

OpenAI's /v1/audio/translations endpoint (translate audio to English) isn't implemented. Workaround: transcribe with whisper-large-v3-turbo (it auto-handles language), then pipe the text through Chat completions with a translation prompt.

Audio-in chat

A different path from STT: some chat models accept audio as message content directly (e.g. Voxtral variants). That's not this endpoint — it's Chat completions with audio content blocks. Check _meta.capabilities.audio_input on GET /v1/models/{id}?include_meta=true to find them.

Errors

VALIDATION_4016 — file exceeds 25 MB.
VALIDATION_4005 — unsupported audio format.
INFERENCE_3001 — unknown model.
AUTH_1015 — missing inference.audio scope.

Models for STT/TTS model discovery • Routing for bulk transcription cost savings.

Audio

Speech (TTS)

Example

Request

Response

Transcriptions (STT)

Example

Request (multipart fields)

Response

What about translations?

Audio-in chat

Errors

On this page