Errors
Error-code structure, the common ones, and retry patterns that work
An error response always includes an HTTP status, a structured body, and enough information to decide whether to retry or fix your request.
Shape
{
"error": {
"code": "INFERENCE_3207",
"message": "Input exceeds the model's context window",
"details": { "context_length": 131072, "input_tokens": 164228 }
}
}code is the stable identifier — match on it, not on the message. Messages are for humans and may be edited. details is optional and varies per error.
Code ranges
Codes are prefixed by service with reserved number ranges:
| Prefix | Range | What it covers |
|---|---|---|
AUTH_ | 1xxx | Authentication, API keys, sessions, scopes, vault |
BILLING_ | 2xxx | Quota, subscriptions, credits, energy, withdrawal |
INFERENCE_ | 3xxx | Model capability, provider failures, routing, context |
VALIDATION_ | 4xxx | Request shape, formats, ranges, missing fields |
SYSTEM_ | 9xxx | Internal — usually temporary |
A few others exist (resource, kit, engine, hub) but most API callers only see the five above.
The common ones
| Code | Status | Meaning | Action |
|---|---|---|---|
AUTH_1007 | 401 | No API key | Add Authorization: Bearer <key> or x-api-key. |
AUTH_1008 | 401 | Invalid API key | Key is malformed, expired, or revoked. Create a new one. |
AUTH_1015 | 403 | Insufficient scope | The key exists but can't call this endpoint. details.required_scope tells you which. |
AUTH_1028 | 429 | Rate-limited | Back off per Retry-After. See Rate limits. |
BILLING_2001 | 402 | Out of energy | Top up, or switch to a plan with api_usage. |
BILLING_2003 | 402 | Out of credits | Same as above, in credits. |
INFERENCE_3001 | 404 | Unknown model ID | Check GET /v1/models. Suffixes like :speed must match VALID_FLAVORS. |
INFERENCE_3103 | 502 | All providers failed | Transient. Retry. |
INFERENCE_3104 | 503 | No providers available | Your filters excluded every provider. Loosen them. |
INFERENCE_3107 | 504 | Upstream timeout | Transient. Retry. |
INFERENCE_3108 | 429 | Provider rate-limited | Transient. Retry with backoff. |
INFERENCE_3201 | 400 | Model has no vision | Pick a vision-capable model — check _meta.capabilities.vision on GET /v1/models. |
INFERENCE_3202 | 400 | Model has no tools | Same — check _meta.capabilities.function_calling. |
INFERENCE_3207 | 400 | Context window exceeded | Trim the prompt or pick a model with more headroom. |
INFERENCE_3208 | 400 | Content rejected by safety filter | Cannot be retried verbatim. |
VALIDATION_4002 | 400 | Missing required field | details.field names it. |
VALIDATION_4005 | 400 | Invalid format | details.field names it. |
VALIDATION_4008 | 400 | Quota exceeded | Plan limit hit — see details. |
This isn't the whole enum. The full list is in services/shared/exceptions.py — we don't paste it here because it grows faster than we can update docs.
Transient vs terminal
A rough rule that holds in practice:
- Retry
INFERENCE_3103,INFERENCE_3105(provider error),INFERENCE_3107(timeout),INFERENCE_3108(provider rate limit),AUTH_1028(our rate limit), and any9xxxsystem error. Use exponential backoff with a cap. - Don't retry auth errors other than 1028, billing errors, validation errors, capability errors (
INFERENCE_3201–3207), and content-policy rejections. Retrying won't change the outcome; fix the request.
We're still working on surfacing some upstream provider errors more cleanly — today a provider-side timeout can come through as INFERENCE_3105 with the provider's raw message in details. If that's rough, tell us.
A retry pattern
import time
import random
from openai import OpenAI, APIError, RateLimitError, APIStatusError
TRANSIENT = {"INFERENCE_3103", "INFERENCE_3105", "INFERENCE_3107", "INFERENCE_3108", "AUTH_1028"}
def call_with_retry(client: OpenAI, **kwargs):
delay = 0.5
for attempt in range(6):
try:
return client.chat.completions.create(**kwargs)
except (RateLimitError, APIStatusError) as e:
code = (e.body or {}).get("error", {}).get("code") if hasattr(e, "body") else None
retry_after = getattr(e.response, "headers", {}).get("Retry-After")
if code and code not in TRANSIENT:
raise # terminal — don't retry
if retry_after:
sleep_for = float(retry_after)
else:
sleep_for = delay * (1.5 ** attempt) + random.uniform(0, 0.25)
sleep_for = min(sleep_for, 30)
time.sleep(sleep_for)
raise RuntimeError("transient retries exhausted")Three things this gets right: honors Retry-After, respects the terminal/transient split by code, caps delay so a long outage doesn't eat your process.
Related
Response-level backoff headers: Rate limits. Authentication-specific errors and scope setup: Authentication. Picking a model with the capability your request needs: Models.