Melious
Concepts

Models

Open-weight only, how families are organized, and where to find the live catalog

Melious runs open-weight models only. We don't host Claude, GPT-4o, or Gemini proper. The API shapes of OpenAI and Anthropic work anyway, because they're the shapes everyone writes clients against — but the inference itself runs on weights that are public.

The stance

Open-weight is a constraint, not a preference. It means:

  • Every model we serve has publicly downloadable weights. Qwen, GLM, DeepSeek, Kimi, Mistral, Llama, FLUX, Whisper, Voxtral, and the rest.
  • If a model is closed-weight, we don't serve it. There's no secret back door to a proprietary API.
  • We map Anthropic/OpenAI model names where it helps unmodified clients keep working. Claude Code sends claude-sonnet-*; we map that to an open-weight model before inference. The response carries the original name back so the client doesn't get confused. See From Anthropic.

Why this shape? Because verifiable inference matters when your data is the input. You can inspect the weights we run. You can download them and benchmark them yourself. You can audit what your prompts go through. That's not a line we'd be able to hold if closed models were in the mix.

Families and capabilities

We group models by family rather than by vendor — the capability differences matter more than the org that released them.

General-purpose chat. GLM (Z.ai), Qwen (Alibaba), DeepSeek, Mistral Small, Llama Instruct, Hermes (NousResearch), GPT-OSS (OpenAI's open-weight release). These are the everyday workhorses — pick by context window, price, and speed.

Reasoning / thinking. DeepSeek-R1, Kimi Thinking, Qwen Thinking variants. Longer responses, higher cost, better on multi-step problems. Slower per token because they produce explicit intermediate reasoning. Use reasoning_effort: "high" when you want them to think harder, "low" to keep it quick.

Vision. Qwen3-VL, Mistral Small (vision variants), Gemma-3. They accept image_url content blocks alongside text. See Vision.

Coding-specialized. Qwen3-Coder, DevStral, some Kimi variants. Tuned on code corpora — pick one when your workload is mostly software.

Audio input. Voxtral. A chat model that takes audio as input alongside text; different from the /v1/audio/transcriptions endpoint, which returns a plain transcript.

Embeddings. BGE-M3, Qwen3-Embedding, Multilingual E5. Multilingual by default. See Embeddings.

Rerank. Reranker variants of the above families. Takes a query plus a document list, returns them reordered by relevance. See Rerank.

Images. FLUX (schnell and dev), SDXL variants. Text-to-image, no editing. See Images.

Speech-to-text. Whisper large-v3 and turbo variants. 50+ languages. See Audio.

Finding the right model

We deliberately don't keep a static model list in these docs. Any table we'd paste here would be out-of-date by the next release. The hub is the source of truth.

Context windows and limits

Context windows vary by model — some are 8k, some are 200k+. The _meta.context_length field on GET /v1/models/{id}?include_meta=true gives the exact number for the model you're about to call. Exceeding it yields INFERENCE_3207 — trim the prompt or pick a model with more headroom.

Request body size is capped at roughly 10 MB. Batch file uploads go up to 105 MB; see Files.

What about caching

Prompt caching — the Anthropic-style cache_control blocks or OpenAI's automatic prompt caching — is not a user-facing feature on Melious today. Some providers do their own cache handling transparently for repeated prefixes, which can lower latency on your side, but it's not something you can mark up explicitly in the request. If you want explicit cache control, tell us.

Response fields that describe which model actually ran (for clients sending mapped names): Messages. Picking an equivalent to an OpenAI model name: From OpenAI. Capability detection programmatically: Models reference.

On this page