LlamaIndex
RAG pipelines and agent loops on Melious — OpenAILike client for models, embeddings, and reranking
LlamaIndex (originally GPT Index) is a Python and TypeScript framework focused on retrieval and structured data for LLM apps. The core building blocks are indexes, query engines, and agents — connected by a global Settings object that sets the default LLM, embedding model, and token counter. For OpenAI-shape endpoints like Melious, the OpenAILike client and OpenAILikeEmbedding companion are the first-party way to point the whole pipeline at a custom base URL. We recommend them over community shims: same auth, same config, fewer surprises. Every index, query engine, and agent built on top works unchanged.
Setup
Install
pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai-likeexport MELIOUS_API_KEY=sk-mel-<YOUR_API_KEY>Configure the default LLM and embedding model
LlamaIndex has a global Settings object that propagates to every index:
import os
from llama_index.core import Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding
Settings.llm = OpenAILike(
model="glm-5.1",
api_base="https://api.melious.ai/v1",
api_key=os.environ["MELIOUS_API_KEY"],
is_chat_model=True,
is_function_calling_model=True,
)
Settings.embed_model = OpenAILikeEmbedding(
model_name="bge-m3",
api_base="https://api.melious.ai/v1",
api_key=os.environ["MELIOUS_API_KEY"],
)is_chat_model=True and is_function_calling_model=True are both required for agent workflows. OpenAILike defaults to False on both because it targets models that may or may not support them — our chat models do.
First RAG over a directory
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What does the onboarding flow do?")
print(response)Both the completion call and the embedding call route through Melious.
Agent with tools
import asyncio
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core.tools import FunctionTool
def multiply(a: int, b: int) -> int:
"""Multiply two integers."""
return a * b
agent = FunctionAgent(
tools=[FunctionTool.from_defaults(fn=multiply)],
llm=Settings.llm,
system_prompt="You are a careful arithmetician.",
)
response = asyncio.run(agent.run("What's 17 times 23?"))
print(response)FunctionAgent speaks the OpenAI tool-calling shape natively, which is what Melious serves. The workflow versions of FunctionAgent and ReActAgent (both at llama_index.core.agent.workflow) are the supported path; the older ReActAgent.from_tools(...) + agent.chat(...) shape that still appears in stale tutorials is no longer the recommended API. For tool-heavy agents, glm-5.1 is a safe default; browse melious.ai/hub/models for alternatives.
Reranking
OpenAILikeRerank doesn't exist yet — for reranker models, wire a SentenceTransformerRerank against a local model, or call our POST /v1/rerank endpoint directly inside a custom NodePostprocessor. See Models for available rerankers.
Embedding batch size
Per-input token limit and batch size depend on the upstream provider; LlamaIndex batches automatically, and we enforce real limits server-side. If you index very large corpora and hit rate limits, tune embed_batch_size down:
Settings.embed_model = OpenAILikeEmbedding(
model_name="bge-m3",
api_base="https://api.melious.ai/v1",
api_key=os.environ["MELIOUS_API_KEY"],
embed_batch_size=64,
)Smaller batches mean more roundtrips but fewer rate-limit retries on large builds.
What's different
- Use
OpenAILike, notOpenAI— the standardOpenAIclass acceptsapi_base, but it infersis_chat_model,is_function_calling_model, andcontext_windowfrom a hard-coded OpenAI model catalog. Unknown ids (anything that isn't an OpenAI model) get wrong defaults or fail validation.OpenAILikelets you set those explicitly. - Token accounting — LlamaIndex tracks tokens via
TokenCountingHandler. Pass atokenizer=callable when constructing it (or setSettings.tokenizer); the default is OpenAI'scl100k_base-shape tokenizer, so counts against our models are approximate but close enough for budgeting. - Async support —
OpenAILikesupportsacompleteandastream_complete. Use them inside workflows for concurrent retrieval.
When it breaks
is_chat_model=Falsebehavior — LlamaIndex treats the model as completion-only and your prompts get rendered as single-turn strings. Set bothis_chat_model=Trueandis_function_calling_model=True.- Rerankers silently skipped — if your pipeline includes a reranker you haven't wired up, LlamaIndex warns once and continues. Check logs.
- Context-window overflow — chat models have large but not infinite contexts.
Settings.context_windowonly drives client-side chunking and prompt-helper math; the server enforces the real limit.
Errors and retry patterns: Errors.