Melious
Concepts

Rate limits

What throttles a Melious account, and how to read the headers

Rate limits are set per account, not per key. A tight key and a production key share the same bucket — creating more keys doesn't multiply your quota.

Tiers

Exact numbers live on melious.ai/pricing because they change when we add plans. The shape is:

  • Free — small per-minute and per-day caps, enough to try things out and run a hobby project.
  • Pro — meaningfully higher caps for most production workloads.
  • Enterprise — negotiated per account; typically removes the per-minute ceiling and applies a per-hour budget instead.

If you're hitting the free tier's ceiling and not ready to upgrade, the usual move is to add :batch to non-critical model IDs — those go through a separate path that's cheaper and less rate-sensitive. See Routing.

Reading the headers

Every response carries the current limit state:

HeaderMeaning
X-RateLimit-Limit-RequestsYour per-minute request ceiling.
X-RateLimit-Remaining-RequestsHow many you have left in this window.
X-RateLimit-Reset-RequestsSeconds until the counter resets.
Retry-AfterWhen set on a 429, wait this many seconds before retrying. Otherwise absent.

If you're building agents or pipelines, read these on every response. They're cheap to parse and they save you from rediscovering the limit in production.

When you're throttled

A request over the limit returns 429 Too Many Requests with AUTH_1028 in the error body. The Retry-After header tells you when it's safe to try again.

Recommended retry shape:

import time
from openai import OpenAI, RateLimitError

client = OpenAI(
    api_key="sk-mel-<YOUR_API_KEY>",
    base_url="https://api.melious.ai/v1",
)

def call_with_retry(messages, attempts=5):
    delay = 1.0
    for i in range(attempts):
        try:
            return client.chat.completions.create(
                model="glm-4.7",
                messages=messages,
            )
        except RateLimitError as e:
            retry_after = getattr(e.response.headers, "get", lambda _: None)("Retry-After")
            sleep_for = float(retry_after) if retry_after else delay
            time.sleep(sleep_for)
            delay = min(delay * 2, 30)
    raise RuntimeError("rate-limit retries exhausted")

Exponential backoff, cap the delay, stop after a handful of attempts rather than looping forever. This pattern also handles transient upstream errors — see Errors.

Patterns that work

A few things we've seen from production users:

  • Run a small concurrency cap on your end, not the maximum the plan allows. If your plan's limit is 300 requests/minute, running at 250 gives the system room to absorb bursts without 429ing you.
  • Split critical and bulk traffic onto separate keys. Same bucket, but scoping lets you cancel the bulk key independently if it runs away.
  • For batch-style work, use /v1/batches. It bypasses the per-minute counter and pays less per token. See Batch workflow.

If you need more

We don't have an "unlimited" tier, and we're honest about why — if every request mattered equally, rate limits would be optional. They aren't.

Enterprise plans are negotiated. Email sales if you need a higher ceiling, a per-hour budget shape, or dedicated capacity on a specific provider.

Errors beyond 429 (billing, auth, model capability): Errors. Asynchronous bulk work with a separate quota path: Batches.

On this page