Batches

Run many inference requests asynchronously, in exchange for more flexible capacity and routing that favors cheaper providers. OpenAI-compatible Batches API.

Batches don't run in real time — the /v1/batches endpoints create and monitor jobs, but the actual work happens in the background and completes within the completion_window (typically 24 hours). For the end-to-end flow, see Batch workflow.

Base path: /v1/batches Auth: Bearer token or x-api-key. Create requires the scope of the target endpoint (inference.chat, inference.embeddings, …). List/get/cancel require any inference scope.

Create a batch

POST /v1/batches

Request

Parameter	Type	Required	Description
`input_file_id`	string	yes	ID of a file uploaded via `POST /v1/files` with `purpose: "batch"`.
`endpoint`	string	yes	Target endpoint: `/v1/chat/completions`, `/v1/embeddings`, `/v1/images/generations`, or `/v1/audio/speech`.
`completion_window`	string	no (default `"24h"`)	How long the system has to complete the job.
`metadata`	object	no	Up to 16 custom key–value pairs for your own tracking.

Example:

curl https://api.melious.ai/v1/batches \
  -H "Authorization: Bearer sk-mel-<YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_id": "file_abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h",
    "metadata": {"run": "nightly-classifier-2026-04-22"}
  }'

Response

{
  "id": "batch_abc123",
  "object": "batch",
  "endpoint": "/v1/chat/completions",
  "errors": null,
  "input_file_id": "file_abc123",
  "completion_window": "24h",
  "status": "queued",
  "output_file_id": null,
  "error_file_id": null,
  "created_at": 1699999999,
  "in_progress_at": null,
  "expires_at": 1700086399,
  "finalizing_at": null,
  "completed_at": null,
  "failed_at": null,
  "expired_at": null,
  "request_counts": { "processing": 0, "succeeded": 0, "errored": 0 },
  "metadata": {"run": "nightly-classifier-2026-04-22"}
}

Status values

queued — accepted, not yet started.
in_progress — currently processing.
finalizing — all requests done, writing output file.
succeeded — complete, results in output_file_id.
failed — terminal failure at batch level.
expired — window elapsed before completion.
cancelling / cancelled — you called cancel.

List batches

GET /v1/batches?limit=20&after=<cursor>

Paginated list of your batches.

Query param	Type	Default	Description
`limit`	integer	`20`	Max 100.
`after`	string	none	Cursor returned by a previous call.

Response:

{
  "object": "list",
  "data": [ /* batch objects */ ],
  "first_id": "batch_...",
  "last_id": "batch_...",
  "has_more": true
}

Retrieve a batch

GET /v1/batches/{batch_id}

Returns the batch object. Poll this for status; request_counts updates as work progresses.

Cancel a batch

POST /v1/batches/{batch_id}/cancel

Cancels a queued or in-progress batch. Returns the batch object with status cancelling (or cancelled if the cancellation was synchronous). Partially-completed work is kept — you'll still get an output_file_id with whatever finished.

Input file format (JSONL)

One request per line. Each line needs custom_id, method, url, and body:

{"custom_id": "q1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "glm-4.7", "messages": [{"role": "user", "content": "..."}]}}
{"custom_id": "q2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "glm-4.7", "messages": [{"role": "user", "content": "..."}]}}

custom_id must be unique within the file and is echoed into the output so you can match responses back.

Model IDs in a batch file can include the :batch flavor suffix to explicitly pick the cheapest routing tier.

Output file format

One line per request in the input, in the same shape as the API's normal response, wrapped with your custom_id:

{"custom_id": "q1", "response": {"status_code": 200, "request_id": "...", "body": {"choices": [...], ...}}, "error": null}
{"custom_id": "q2", "response": null, "error": {"code": "INFERENCE_3207", "message": "Context window exceeded"}}

Errors are per-row, not per-batch — a single bad request doesn't fail the whole job.

Pricing and rate limits

Batches run against the same per-token rates — there's no discount applied for being async. The savings come from the router picking cheaper providers under :batch flavor, and from sitting outside the per-minute rate limit on the realtime endpoints.

See Pricing and Rate limits.

Errors

INFERENCE_3302 — input_file_id doesn't exist or belongs to a different user.
INFERENCE_3303 — input file isn't purpose: "batch" or fails JSONL validation.
INFERENCE_3304 — file exceeds the 105 MB cap.
INFERENCE_3305 — too many requests in one file.
INFERENCE_3306 — a JSONL line doesn't parse.
VALIDATION_4007 — unsupported endpoint value.
AUTH_1015 — key lacks the scope for the target endpoint.

Files for uploading input and downloading output • Batch workflow for the full lifecycle • Routing for the :batch flavor.

Batches

On this page