Batch workflow

For non-critical work, batching is usually the right move — cheaper routing, outside the per-minute rate limit, and no need to hold open thousands of concurrent requests.

The pattern is: write a JSONL file of requests, upload it, create a batch, poll for completion, download the output.

Full example

This Python script takes a list of prompts, submits them, and returns the results indexed by your custom ID. One file, end to end.

import json
import time
import httpx

API = "https://api.melious.ai/v1"
KEY = "sk-mel-<YOUR_API_KEY>"
HEAD = {"Authorization": f"Bearer {KEY}"}


def run_batch(prompts: dict[str, str], model: str = "glm-4.7:batch") -> dict[str, str]:
    # 1. Build JSONL — one request per line, with a custom_id we'll use to match results
    lines = [
        json.dumps({
            "custom_id": cid,
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {"model": model, "messages": [{"role": "user", "content": prompt}]},
        })
        for cid, prompt in prompts.items()
    ]
    jsonl = "\n".join(lines).encode()

    # 2. Upload as a file with purpose=batch
    upload = httpx.post(
        f"{API}/files",
        headers=HEAD,
        files={"file": ("requests.jsonl", jsonl, "application/jsonl")},
        data={"purpose": "batch"},
    ).json()
    input_file_id = upload["id"]

    # 3. Create the batch
    batch = httpx.post(
        f"{API}/batches",
        headers={**HEAD, "Content-Type": "application/json"},
        json={
            "input_file_id": input_file_id,
            "endpoint": "/v1/chat/completions",
            "completion_window": "24h",
        },
    ).json()
    batch_id = batch["id"]
    print(f"batch {batch_id} queued")

    # 4. Poll until done
    while True:
        time.sleep(30)
        status = httpx.get(f"{API}/batches/{batch_id}", headers=HEAD).json()
        print(f"  status: {status['status']} "
              f"({status['request_counts']['succeeded']}/{sum(status['request_counts'].values())})")
        if status["status"] in ("succeeded", "failed", "expired", "cancelled"):
            break

    if status["status"] != "succeeded":
        raise RuntimeError(f"batch ended in status {status['status']}")

    # 5. Download the output JSONL and parse
    output_file_id = status["output_file_id"]
    body = httpx.get(f"{API}/files/{output_file_id}/content", headers=HEAD).text

    results = {}
    for line in body.strip().splitlines():
        entry = json.loads(line)
        if entry["error"]:
            results[entry["custom_id"]] = f"ERROR: {entry['error']['message']}"
        else:
            choice = entry["response"]["body"]["choices"][0]
            results[entry["custom_id"]] = choice["message"]["content"]
    return results


if __name__ == "__main__":
    prompts = {
        f"q{i}": f"In one sentence, why did Hanseatic city #{i} matter?"
        for i in range(10)
    }
    answers = run_batch(prompts)
    for cid, text in answers.items():
        print(f"{cid}: {text[:80]}")

Five steps, each mapping to one endpoint:

JSONL build — each line is a full request shaped like the real endpoint's body, wrapped with a custom_id you choose.
POST /v1/files — upload the JSONL with purpose=batch.
POST /v1/batches — create the job pointing at the file and the target endpoint.
GET /v1/batches/{id} — poll until status == "succeeded".
GET /v1/files/{id}/content — download the output, match by custom_id.

When to pick batch

Good fits:

Nightly classification, summarization, or extraction over a large set.
Backfills and reprocessing of historical data.
Evaluation runs.
Embedding a whole corpus (embeddings work over batch too).

Bad fits:

Anything user-facing in real time.
Workloads with hard latency SLAs shorter than the completion_window.
Small runs (say, under 50 requests) — the per-minute rate limit is fine for those and you skip the upload/download dance.

Batches don't fail atomically — a single bad request returns an error for that row, and the rest still run. The output JSONL has error set for failed rows and response set for successes, both keyed by your custom_id.

Our example above coerces errors to a string prefix; in production you'd want something structured for retry logic:

if entry["error"]:
    retriable = entry["error"]["code"] in {"INFERENCE_3103", "INFERENCE_3107", "INFERENCE_3108"}
    results[entry["custom_id"]] = {"ok": False, "error": entry["error"], "retry": retriable}
else:
    results[entry["custom_id"]] = {"ok": True, "text": entry["response"]["body"]["choices"][0]["message"]["content"]}

For retry, build a new JSONL of just the retriable rows and run another batch.

Cost

Batch isn't cheaper per token today — there's no discount applied to batched requests. The savings come from the :batch flavor suffix routing to the cheapest providers, and from being outside the per-minute realtime rate limit. See Pricing.

Gotchas

105 MB input cap. If you're running more than that per file, split into multiple batches.
custom_id must be unique within a file and is echoed into the output. Use something you can match back reliably — UUIDs are fine, sequential IDs are fine.
The endpoint you pick constrains the file. A /v1/chat/completions batch can't mix in embedding requests. One endpoint per batch.

Batches reference • Files reference • Routing for the :batch flavor.

Batch workflow

Full example

When to pick batch

Handling partial failures

Cost

Gotchas

On this page

Batch workflow

Full example

When to pick batch

Handling partial failures

Cost

Gotchas

Related

On this page