Inference API
OpenAI-compatible API for chat completions, embeddings, images, and audio
Inference API
The Melious Inference API provides OpenAI-compatible endpoints for AI model inference. Use the same code you'd write for OpenAI, with added benefits like multi-provider routing and environmental tracking.
Quick Start
from openai import OpenAI
client = OpenAI(
api_key="sk-mel-your-api-key-here",
base_url="https://api.melious.ai/v1"
)
response = client.chat.completions.create(
model="gpt-oss-120b",
messages=[{"role": "user", "content": "Hello!"}]
)Available Endpoints
Chat Completions
Generate text responses with vision and function calling support.
Embeddings
Create vector embeddings for semantic search and RAG.
Reranking
Rerank documents by semantic relevance to a query.
Image Generation
Generate images from text descriptions.
Audio
Speech-to-text transcription.
Models
List and explore available models.
Streaming
Real-time streaming responses.
Endpoint Summary
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions | POST | Chat completions with streaming, vision, and tools |
/v1/embeddings | POST | Generate vector embeddings |
/v1/rerank | POST | Rerank documents by semantic relevance |
/v1/images/generations | POST | Generate images from text |
/v1/audio/transcriptions | POST | Speech-to-text transcription |
/v1/models | GET | List available models |
/v1/models/{id} | GET | Get specific model details |
Authentication
All inference endpoints require an API key:
Authorization: Bearer sk-mel-your-api-key-hereCreate API keys at melious.ai/account/api/keys.
OpenAI Compatibility
Melious is a drop-in replacement for the OpenAI API. All standard OpenAI parameters are supported:
{
"model": "gpt-oss-120b",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 1000,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
"stream": false
}Melious Extensions
In addition to standard OpenAI parameters, Melious adds:
Routing Presets
Optimize for different priorities:
{
"model": "gpt-oss-120b",
"messages": [...],
"preset": "environment"
}| Preset | Description |
|---|---|
balanced | Default - balance all metrics |
speed | Lowest latency |
price | Lowest cost |
quality | Highest quality |
environment | Lowest carbon footprint |
Provider Filters
Restrict to specific regions or constraints:
{
"model": "gpt-oss-120b",
"messages": [...],
"filters": {
"countries": ["NL", "FR", "DE"],
"max_carbon_intensity": 200,
"min_speed_tps": 500
}
}Environmental Impact
Every response includes environmental metrics:
{
"id": "chatcmpl-abc123",
"choices": [...],
"usage": {...},
"environment_impact": {
"carbon_g_co2": 0.06,
"water_liters": 0.0002,
"energy_kwh": 0.00015,
"renewable_percent": 85,
"pue": 1.18,
"provider_id": "nebius",
"location": "NL"
}
}Available Models
Chat Models (27)
| Model | Brand | Context | Vision | Tools | Reasoning |
|---|---|---|---|---|---|
gpt-oss-120b | Meta | 131K | No | Yes | No |
llama-3.1-8b-instruct | Meta | 131K | No | No | No |
qwen3-235b-a22b-instruct | Qwen | 268K | No | Yes | No |
qwen3-235b-a22b-thinking | Qwen | 262K | No | Yes | Yes |
qwen3-coder-480b-a35b-instruct | Qwen | 262K | No | Yes | No |
qwen3-coder-30b-a3b-instruct | Qwen | 262K | No | Yes | No |
qwen3-30b-a3b-instruct | Qwen | 262K | No | Yes | No |
qwen3-30b-a3b-thinking | Qwen | 262K | No | Yes | Yes |
qwen3-32b | Qwen | 40K | No | Yes | Hybrid |
qwen3-4b-instruct | Qwen | 262K | No | Yes | No |
qwen3-4b-thinking | Qwen | 262K | No | Yes | Yes |
deepseek-r1-0528 | DeepSeek | 131K | No | Yes | Yes |
kimi-k2-instruct | Moonshot | 134K | No | Yes | No |
kimi-k2-thinking | Moonshot | 134K | No | Yes | Yes |
hermes-4-405b | NousResearch | 131K | No | Yes | Hybrid |
hermes-4-70b | NousResearch | 131K | No | Yes | Hybrid |
gpt-oss-120b | OpenAI (OSS) | 128K | No | Yes | Yes |
gpt-oss-20b | OpenAI (OSS) | 128K | No | Yes | Yes |
gemma-3-27b | 131K | Yes | Yes | No | |
mistral-small-3.2-24b-instruct | Mistral | 131K | Yes | Yes | No |
devstral-small-2505 | Mistral | 131K | No | Yes | No |
glm-4.5 | ZAI | 131K | No | Yes | Hybrid |
glm-4.5-air | ZAI | 131K | No | Yes | Hybrid |
glm-4.6 | ZAI | 131K | No | Yes | Hybrid |
deepseek-ocr | DeepSeek | 32K | Yes | Yes | No |
nemotron-nano-v2-12b | NVIDIA | 131K | No | Yes | No |
intellect-3 | PrimeIntellect | 131K | No | Yes | No |
Audio-Input Model (1)
| Model | Brand | Context | Description |
|---|---|---|---|
voxtral-small-24b-2507 | Mistral | 131K | Accepts audio input for chat |
Embedding Models (8)
| Model | Brand | Context |
|---|---|---|
bge-m3 | BAAI | 8K |
bge-multilingual-gemma2 | BAAI | 8K |
bge-en-icl | BAAI | 32K |
bge-large-en-v1.5 | BAAI | 512 |
bge-base-en-v1.5 | BAAI | 512 |
qwen3-embedding-8b | Qwen | 32K |
e5-mistral-7b-instruct | intfloat | 32K |
paraphrase-multilingual-mpnet | Sentence Transformers | 512 |
Reranking Models (1)
| Model | Brand | Context |
|---|---|---|
bge-reranker-v2-m3 | BAAI | 8K |
Image Models (6)
| Model | Brand | Description |
|---|---|---|
flux-dev | Black Forest Labs | High-quality generation |
flux-schnell | Black Forest Labs | Fast generation |
sdxl-base-1.0 | Stability AI | SDXL base model |
sdxl-base-v10 | Stability AI | SDXL base v10 |
sdxl-lightning-4step | ByteDance | Fast 4-step generation |
sdxl-lightning-8step | ByteDance | Quality 8-step generation |
Audio Models (2)
| Model | Brand | Type |
|---|---|---|
whisper-large-v3 | OpenAI | Speech-to-Text |
whisper-large-v3-turbo | OpenAI | Speech-to-Text (fast) |
All 46 models are open-source and hosted on European infrastructure. See Models for detailed capabilities.
Error Handling
All inference errors follow OpenAI's error format:
{
"error": {
"message": "Insufficient energy balance",
"type": "insufficient_quota",
"code": "BILLING_INSUFFICIENT_ENERGY"
}
}Common Error Codes
| Code | HTTP Status | Description |
|---|---|---|
AUTH_INVALID_API_KEY | 401 | Invalid or missing API key |
AUTH_INSUFFICIENT_PERMISSIONS | 403 | API key lacks required scope |
BILLING_INSUFFICIENT_ENERGY | 403 | Not enough energy balance |
VALIDATION_INVALID_VALUE | 400 | Invalid parameter value |
VALIDATION_REQUIRED_FIELD | 400 | Missing required field |
INFERENCE_PROVIDER_ERROR | 502 | Provider request failed |
INFERENCE_NO_PROVIDERS_AVAILABLE | 503 | No providers match filters |
Rate Limits
| Tier | Requests/Minute | Requests/Day |
|---|---|---|
| Free | 60 | 5,000 |
| Pro | 300 | 50,000 |
| Enterprise | Custom | Custom |
Rate limit headers are included in all responses:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1699999999Best Practices
- Use streaming for long responses to improve perceived latency
- Set max_tokens to avoid unexpectedly long responses
- Handle rate limits with exponential backoff
- Cache embeddings for frequently used texts
- Use routing presets to optimize for your priorities
- Monitor environmental impact to track sustainability
See Also
- Getting Started - Quick start guide
- Tools - AI tools and utilities