Errors

An error response always includes an HTTP status, a structured body, and enough information to decide whether to retry or fix your request.

Shape

{
  "error": {
    "code": "INFERENCE_3207",
    "message": "Input exceeds the model's context window",
    "details": { "context_length": 131072, "input_tokens": 164228 }
  }
}

code is the stable identifier — match on it, not on the message. Messages are for humans and may be edited. details is optional and varies per error.

Code ranges

Codes are prefixed by service with reserved number ranges:

Prefix	Range	What it covers
`AUTH_`	1xxx	Authentication, API keys, sessions, scopes, vault
`BILLING_`	2xxx	Quota, subscriptions, credits, energy, withdrawal
`INFERENCE_`	3xxx	Model capability, provider failures, routing, context
`VALIDATION_`	4xxx	Request shape, formats, ranges, missing fields
`SYSTEM_`	9xxx	Internal — usually temporary

A few others exist (resource, kit, engine, hub) but most API callers only see the five above.

The common ones

Code	Status	Meaning	Action
`AUTH_1007`	401	No API key	Add `Authorization: Bearer <key>` or `x-api-key`.
`AUTH_1008`	401	Invalid API key	Key is malformed, expired, or revoked. Create a new one.
`AUTH_1015`	403	Insufficient scope	The key exists but can't call this endpoint. `details.required_scope` tells you which.
`AUTH_1028`	429	Rate-limited	Back off per `Retry-After`. See Rate limits.
`BILLING_2001`	402	Out of energy	Top up, or switch to a plan with `api_usage`.
`BILLING_2003`	402	Out of credits	Same as above, in credits.
`INFERENCE_3001`	404	Unknown model ID	Check `GET /v1/models`. Suffixes like `:speed` must match VALID_FLAVORS.
`INFERENCE_3103`	502	All providers failed	Transient. Retry.
`INFERENCE_3104`	503	No providers available	Your filters excluded every provider. Loosen them.
`INFERENCE_3107`	504	Upstream timeout	Transient. Retry.
`INFERENCE_3108`	429	Provider rate-limited	Transient. Retry with backoff.
`INFERENCE_3201`	400	Model has no vision	Pick a vision-capable model — check `_meta.capabilities.vision` on `GET /v1/models`.
`INFERENCE_3202`	400	Model has no tools	Same — check `_meta.capabilities.function_calling`.
`INFERENCE_3207`	400	Context window exceeded	Trim the prompt or pick a model with more headroom.
`INFERENCE_3208`	400	Content rejected by safety filter	Cannot be retried verbatim.
`VALIDATION_4002`	400	Missing required field	`details.field` names it.
`VALIDATION_4005`	400	Invalid format	`details.field` names it.
`VALIDATION_4008`	400	Quota exceeded	Plan limit hit — see `details`.

This isn't the whole enum. The full list is in services/shared/exceptions.py — we don't paste it here because it grows faster than we can update docs.

Transient vs terminal

A rough rule that holds in practice:

Retry INFERENCE_3103, INFERENCE_3105 (provider error), INFERENCE_3107 (timeout), INFERENCE_3108 (provider rate limit), AUTH_1028 (our rate limit), and any 9xxx system error. Use exponential backoff with a cap.
Don't retry auth errors other than 1028, billing errors, validation errors, capability errors (INFERENCE_3201–3207), and content-policy rejections. Retrying won't change the outcome; fix the request.

We're still working on surfacing some upstream provider errors more cleanly — today a provider-side timeout can come through as INFERENCE_3105 with the provider's raw message in details. If that's rough, tell us.

A retry pattern

import time
import random
from openai import OpenAI, APIError, RateLimitError, APIStatusError

TRANSIENT = {"INFERENCE_3103", "INFERENCE_3105", "INFERENCE_3107", "INFERENCE_3108", "AUTH_1028"}

def call_with_retry(client: OpenAI, **kwargs):
    delay = 0.5
    for attempt in range(6):
        try:
            return client.chat.completions.create(**kwargs)
        except (RateLimitError, APIStatusError) as e:
            code = (e.body or {}).get("error", {}).get("code") if hasattr(e, "body") else None
            retry_after = getattr(e.response, "headers", {}).get("Retry-After")

            if code and code not in TRANSIENT:
                raise  # terminal — don't retry

            if retry_after:
                sleep_for = float(retry_after)
            else:
                sleep_for = delay * (1.5 ** attempt) + random.uniform(0, 0.25)
                sleep_for = min(sleep_for, 30)

            time.sleep(sleep_for)
    raise RuntimeError("transient retries exhausted")

Three things this gets right: honors Retry-After, respects the terminal/transient split by code, caps delay so a long outage doesn't eat your process.

Response-level backoff headers: Rate limits. Authentication-specific errors and scope setup: Authentication. Picking a model with the capability your request needs: Models.