Transcription
POST /v1/audio/transcriptions — turn speech into text, OpenAI-compatible
Turn an audio file into text. OpenAI-compatible request and response, Whisper-class models underneath.
Endpoint:
POST /v1/audio/transcriptionsAuth: Bearer token or x-api-key. Requires scope inference.audio.
Content-Type: multipart/form-data.
Max file size: 25 MB.
Example
from openai import OpenAI
client = OpenAI(
api_key="sk-mel-<YOUR_API_KEY>",
base_url="https://api.melious.ai/v1",
)
with open("meeting.mp3", "rb") as f:
result = client.audio.transcriptions.create(
model="<STT_MODEL_ID>", # a transcription model from the hub
file=f,
language="de",
)
print(result.text)curl https://api.melious.ai/v1/audio/transcriptions \
-H "Authorization: Bearer sk-mel-<YOUR_API_KEY>" \
-F model="<STT_MODEL_ID>" \
-F language="de" \
-F file=@meeting.mp3Pick a current transcription model ID from melious.ai/hub/models (filter by audio), or call GET /v1/models?include_meta=true and look for _meta.type == "audio".
Request (multipart fields)
| Field | Type | Default | Description |
|---|---|---|---|
file | file | — | Audio file. Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm. |
model | string | — | Transcription model ID. |
language | string | auto-detect | ISO-639-1 code ("de", "fr", "en", …). |
response_format | string | "json" | "json", "text", "srt", "vtt", "verbose_json". |
temperature | number | 0 | Sampling temperature, [0, 1]. |
Whisper-class models handle 50+ languages and auto-detect by default. If you know the language, set it — detection adds a few hundred milliseconds and occasionally picks wrong for short clips.
Response
response_format: "json" (default):
{
"text": "Hamburg, Lübeck, Bremen.",
"language": "de",
"duration": 2.1
}verbose_json adds a segments array with per-segment timestamps and confidence. text returns a plain-text body; srt and vtt return subtitle files.
What about translations?
OpenAI's /v1/audio/translations endpoint (transcribe and translate to English in one step) isn't implemented. Workaround: transcribe in the source language, then pipe the text through Chat completions with a translation prompt.
Audio as chat input
A different path from transcription: some chat models accept audio directly as message content, then reason about it rather than just transcribing. That's not this endpoint — it's Chat completions with audio content blocks. Check _meta.capabilities.audio_input on GET /v1/models/{id}?include_meta=true to find those models.
Errors
VALIDATION_4016— file exceeds 25 MB.VALIDATION_4005— unsupported audio format.INFERENCE_3001— unknown model.AUTH_1015— missing theinference.audioscope.
Related
Models for transcription-model discovery • Routing for cost savings on bulk transcription.