Vision

Vision-capable chat models accept images as part of the message content. Works in both OpenAI and Anthropic shapes, with the model having to actually support it (not all do).

OpenAI shape

from openai import OpenAI

client = OpenAI(
    api_key="sk-mel-<YOUR_API_KEY>",
    base_url="https://api.melious.ai/v1",
)

response = client.chat.completions.create(
    model="qwen3-vl-235b-a22b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/hamburg-harbor.jpg"},
                },
            ],
        },
    ],
)
print(response.choices[0].message.content)

Base64 data URI

When you have the bytes in memory (upload, screenshot, generated image), pass a data URI:

import base64

with open("harbor.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="qwen3-vl-235b-a22b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what you see."},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{b64}"},
                },
            ],
        },
    ],
)

Multiple images in one turn

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Which of these is in Hamburg?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/a.jpg"}},
            {"type": "image_url", "image_url": {"url": "https://example.com/b.jpg"}},
        ],
    },
]

Anthropic shape

from anthropic import Anthropic
import base64

client = Anthropic(
    api_key="sk-mel-<YOUR_API_KEY>",
    base_url="https://api.melious.ai",
)

with open("harbor.jpg", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode()

response = client.messages.create(
    model="claude-sonnet-4",   # mapped to a vision-capable open-weight model
    max_tokens=512,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": b64,
                    },
                },
                {"type": "text", "text": "Describe what you see."},
            ],
        },
    ],
)
print(response.content[0].text)

URL-source images also work:

{"type": "image", "source": {"type": "url", "url": "https://example.com/harbor.jpg"}}

Privacy: URLs are re-encoded server-side

When you pass an external URL, Melious fetches the image and converts it to base64 before handing it to the provider. The provider never sees your URL — they only see the bytes. If the fetch fails (bad URL, auth required, 404), you get INFERENCE_3209.

This matters because some external CDNs track request origin and some clients ship user-identifying tokens in URLs. Re-encoding breaks both.

Picking a vision model

Check _meta.capabilities.vision == true on GET /v1/models?include_meta=true, or filter for vision on melious.ai/hub. Common choices today: Qwen3-VL, Mistral Small 3.x vision variants, Gemma-3 27B.

Trade-offs:

Qwen3-VL — strong on OCR, text-in-image, chart reading.
Mistral Small — lighter, cheaper, fine for general "describe this photo" work.
Gemma-3 27B — a middle ground; Google's vision model with a permissive license.

Passing images to a non-vision model returns INFERENCE_3201.

Limits

Per-image tokens are counted at a "high detail" rate, significant for long-context windows.
Max image size varies by provider but generally tops out around 20 MB decoded. Resize very large images before sending.
Animated formats (GIF, WebP animation) — we pass the first frame only. Don't rely on motion content.

If you need layout reasoning (bounding boxes, document structure, form fields), consider Document conversion first — it produces structured markdown that a text model can then reason about, usually cheaper than visual reasoning directly.

Chat completions and Messages for full endpoint details • Models for capability discovery.

Vision

On this page