Rate limits
What throttles a Melious account, and how to read the headers
Rate limits are set per account, not per key. A tight key and a production key share the same bucket — creating more keys doesn't multiply your quota.
Tiers
Exact numbers live on melious.ai/pricing because they change when we add plans. The shape is:
- Free — small per-minute and per-day caps, enough to try things out and run a hobby project.
- Pro — meaningfully higher caps for most production workloads.
- Enterprise — negotiated per account; typically removes the per-minute ceiling and applies a per-hour budget instead.
If you're hitting the free tier's ceiling and not ready to upgrade, the usual move is to add :batch to non-critical model IDs — those go through a separate path that's cheaper and less rate-sensitive. See Routing.
Reading the headers
Every response carries the current limit state:
| Header | Meaning |
|---|---|
X-RateLimit-Limit-Requests | Your per-minute request ceiling. |
X-RateLimit-Remaining-Requests | How many you have left in this window. |
X-RateLimit-Reset-Requests | Seconds until the counter resets. |
Retry-After | When set on a 429, wait this many seconds before retrying. Otherwise absent. |
If you're building agents or pipelines, read these on every response. They're cheap to parse and they save you from rediscovering the limit in production.
When you're throttled
A request over the limit returns 429 Too Many Requests with AUTH_1028 in the error body. The Retry-After header tells you when it's safe to try again.
Recommended retry shape:
import time
from openai import OpenAI, RateLimitError
client = OpenAI(
api_key="sk-mel-<YOUR_API_KEY>",
base_url="https://api.melious.ai/v1",
)
def call_with_retry(messages, attempts=5):
delay = 1.0
for i in range(attempts):
try:
return client.chat.completions.create(
model="glm-4.7",
messages=messages,
)
except RateLimitError as e:
retry_after = getattr(e.response.headers, "get", lambda _: None)("Retry-After")
sleep_for = float(retry_after) if retry_after else delay
time.sleep(sleep_for)
delay = min(delay * 2, 30)
raise RuntimeError("rate-limit retries exhausted")Exponential backoff, cap the delay, stop after a handful of attempts rather than looping forever. This pattern also handles transient upstream errors — see Errors.
Patterns that work
A few things we've seen from production users:
- Run a small concurrency cap on your end, not the maximum the plan allows. If your plan's limit is 300 requests/minute, running at 250 gives the system room to absorb bursts without 429ing you.
- Split critical and bulk traffic onto separate keys. Same bucket, but scoping lets you cancel the bulk key independently if it runs away.
- For batch-style work, use
/v1/batches. It bypasses the per-minute counter and pays less per token. See Batch workflow.
If you need more
We don't have an "unlimited" tier, and we're honest about why — if every request mattered equally, rate limits would be optional. They aren't.
Enterprise plans are negotiated. Email sales if you need a higher ceiling, a per-hour budget shape, or dedicated capacity on a specific provider.
Related
Errors beyond 429 (billing, auth, model capability): Errors. Asynchronous bulk work with a separate quota path: Batches.