LlamaIndex

RAG pipelines and agent loops on Melious — OpenAILike client for models, embeddings, and reranking

LlamaIndex (originally GPT Index) is a Python and TypeScript framework focused on retrieval and structured data for LLM apps. The core building blocks are indexes, query engines, and agents — connected by a global Settings object that sets the default LLM, embedding model, and token counter. For OpenAI-shape endpoints like Melious, the OpenAILike client and OpenAILikeEmbedding companion are the first-party way to point the whole pipeline at a custom base URL. We recommend them over community shims: same auth, same config, fewer surprises. Every index, query engine, and agent built on top works unchanged.

Setup

Install

pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai-like

export MELIOUS_API_KEY=sk-mel-<YOUR_API_KEY>

Configure the default LLM and embedding model

LlamaIndex has a global Settings object that propagates to every index:

import os
from llama_index.core import Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding

Settings.llm = OpenAILike(
    model="glm-5.1",
    api_base="https://api.melious.ai/v1",          
    api_key=os.environ["MELIOUS_API_KEY"],
    is_chat_model=True,                            
    is_function_calling_model=True,                
)

Settings.embed_model = OpenAILikeEmbedding(
    model_name="bge-m3",
    api_base="https://api.melious.ai/v1",
    api_key=os.environ["MELIOUS_API_KEY"],
)

is_chat_model=True and is_function_calling_model=True are both required for agent workflows. OpenAILike defaults to False on both because it targets models that may or may not support them — our chat models do.

First RAG over a directory

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("What does the onboarding flow do?")
print(response)

Both the completion call and the embedding call route through Melious.

Agent with tools

import asyncio
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core.tools import FunctionTool

def multiply(a: int, b: int) -> int:
    """Multiply two integers."""
    return a * b

agent = FunctionAgent(
    tools=[FunctionTool.from_defaults(fn=multiply)],
    llm=Settings.llm,
    system_prompt="You are a careful arithmetician.",
)

response = asyncio.run(agent.run("What's 17 times 23?"))
print(response)

FunctionAgent speaks the OpenAI tool-calling shape natively, which is what Melious serves. The workflow versions of FunctionAgent and ReActAgent (both at llama_index.core.agent.workflow) are the supported path; the older ReActAgent.from_tools(...) + agent.chat(...) shape that still appears in stale tutorials is no longer the recommended API. For tool-heavy agents, glm-5.1 is a safe default; browse melious.ai/hub/models for alternatives.

Reranking

OpenAILikeRerank doesn't exist yet — for reranker models, wire a SentenceTransformerRerank against a local model, or call our POST /v1/rerank endpoint directly inside a custom NodePostprocessor. See Models for available rerankers.

Embedding batch size

Per-input token limit and batch size depend on the upstream provider; LlamaIndex batches automatically, and we enforce real limits server-side. If you index very large corpora and hit rate limits, tune embed_batch_size down:

Settings.embed_model = OpenAILikeEmbedding(
    model_name="bge-m3",
    api_base="https://api.melious.ai/v1",
    api_key=os.environ["MELIOUS_API_KEY"],
    embed_batch_size=64,  
)

Smaller batches mean more roundtrips but fewer rate-limit retries on large builds.

What's different

Use OpenAILike, not OpenAI — the standard OpenAI class accepts api_base, but it infers is_chat_model, is_function_calling_model, and context_window from a hard-coded OpenAI model catalog. Unknown ids (anything that isn't an OpenAI model) get wrong defaults or fail validation. OpenAILike lets you set those explicitly.
Token accounting — LlamaIndex tracks tokens via TokenCountingHandler. Pass a tokenizer= callable when constructing it (or set Settings.tokenizer); the default is OpenAI's cl100k_base-shape tokenizer, so counts against our models are approximate but close enough for budgeting.
Async support — OpenAILike supports acomplete and astream_complete. Use them inside workflows for concurrent retrieval.

When it breaks

is_chat_model=False behavior — LlamaIndex treats the model as completion-only and your prompts get rendered as single-turn strings. Set both is_chat_model=True and is_function_calling_model=True.
Rerankers silently skipped — if your pipeline includes a reranker you haven't wired up, LlamaIndex warns once and continues. Check logs.
Context-window overflow — chat models have large but not infinite contexts. Settings.context_window only drives client-side chunking and prompt-helper math; the server enforces the real limit.

Errors and retry patterns: Errors.

LlamaIndexLlamaIndex

On this page

LlamaIndex