Vision
Send images alongside text — URLs or base64, both shapes
Vision-capable chat models accept images as part of the message content. Works in both OpenAI and Anthropic shapes, with the model having to actually support it (not all do).
OpenAI shape
from openai import OpenAI
client = OpenAI(
api_key="sk-mel-<YOUR_API_KEY>",
base_url="https://api.melious.ai/v1",
)
response = client.chat.completions.create(
model="qwen3-vl-235b-a22b-instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/hamburg-harbor.jpg"},
},
],
},
],
)
print(response.choices[0].message.content)Base64 data URI
When you have the bytes in memory (upload, screenshot, generated image), pass a data URI:
import base64
with open("harbor.jpg", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="qwen3-vl-235b-a22b-instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what you see."},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}"},
},
],
},
],
)Multiple images in one turn
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Which of these is in Hamburg?"},
{"type": "image_url", "image_url": {"url": "https://example.com/a.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/b.jpg"}},
],
},
]Anthropic shape
from anthropic import Anthropic
import base64
client = Anthropic(
api_key="sk-mel-<YOUR_API_KEY>",
base_url="https://api.melious.ai",
)
with open("harbor.jpg", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode()
response = client.messages.create(
model="claude-sonnet-4", # mapped to a vision-capable open-weight model
max_tokens=512,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": b64,
},
},
{"type": "text", "text": "Describe what you see."},
],
},
],
)
print(response.content[0].text)URL-source images also work:
{"type": "image", "source": {"type": "url", "url": "https://example.com/harbor.jpg"}}Privacy: URLs are re-encoded server-side
When you pass an external URL, Melious fetches the image and converts it to base64 before handing it to the provider. The provider never sees your URL — they only see the bytes. If the fetch fails (bad URL, auth required, 404), you get INFERENCE_3209.
This matters because some external CDNs track request origin and some clients ship user-identifying tokens in URLs. Re-encoding breaks both.
Picking a vision model
Check _meta.capabilities.vision == true on GET /v1/models?include_meta=true, or filter for vision on melious.ai/hub. Common choices today: Qwen3-VL, Mistral Small 3.x vision variants, Gemma-3 27B.
Trade-offs:
- Qwen3-VL — strong on OCR, text-in-image, chart reading.
- Mistral Small — lighter, cheaper, fine for general "describe this photo" work.
- Gemma-3 27B — a middle ground; Google's vision model with a permissive license.
Passing images to a non-vision model returns INFERENCE_3201.
Limits
- Per-image tokens are counted at a "high detail" rate, significant for long-context windows.
- Max image size varies by provider but generally tops out around 20 MB decoded. Resize very large images before sending.
- Animated formats (GIF, WebP animation) — we pass the first frame only. Don't rely on motion content.
If you need layout reasoning (bounding boxes, document structure, form fields), consider Document conversion first — it produces structured markdown that a text model can then reason about, usually cheaper than visual reasoning directly.
Related
Chat completions and Messages for full endpoint details • Models for capability discovery.