Embeddings
Generate vector embeddings with OpenAI-compatible API
Overview
Generate high-dimensional vector embeddings from text input using state-of-the-art embedding models. Perfect for semantic search, RAG (Retrieval-Augmented Generation), clustering, and similarity analysis.
Key Features:
- OpenAI-compatible API for easy migration
- Multi-provider routing for best price/performance
- Environment impact tracking (CO2, energy, water)
- Batch embedding support (up to 2048 inputs per request)
- Automatic failover and retry logic
Embeddings are vector representations of text that capture semantic meaning. Similar texts have similar embeddings (measured by cosine similarity).
Authentication
Required: API Key
All requests must include your Melious API key in the Authorization header:
Authorization: Bearer {your_api_key}Permissions: embeddings.create scope required.
Endpoints
Create Embeddings
POST /v1/embeddingsGenerate vector embeddings from text input (single string or array of strings).
Request Body:
{
"model": "qwen3-embedding-8b",
"input": "The quick brown fox jumps over the lazy dog",
"encoding_format": "float"
}Request Fields:
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID (e.g., bge-m3, qwen3-embedding-8b) |
input | string | string[] | Yes | Text to embed (single string or array, max 2048 items) |
encoding_format | string | No | Format for embeddings: "float" (default) or "base64" |
dimensions | integer | No | Number of dimensions (model-specific) |
user | string | No | End-user identifier for abuse monitoring |
| Melious Extensions | |||
mode | string | No | Routing mode: "balanced", "speed", "price", "quality", "environment" |
custom_weights | object | No | Custom routing weights (mutually exclusive with mode) |
filters | object | No | Hard constraints for provider selection |
Routing Filters (optional):
{
"filters": {
"countries": ["NL", "FR", "DE"],
"max_input_cost": 1.0,
"max_carbon_intensity": 300,
"min_speed_tps": 500
}
}Response (200 OK):
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.023, -0.015, 0.042, ...],
"index": 0
}
],
"model": "qwen3-embedding-8b",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
},
"environment_impact": {
"energy_kwh": 0.0001,
"carbon_g_co2": 0.04,
"water_liters": 0.0001,
"renewable_percent": 95,
"pue": 1.15,
"provider_id": "nebius",
"location": "NL"
}
}Response Fields:
| Field | Type | Description |
|---|---|---|
object | string | Always "list" |
data | array | Array of embedding objects |
data[].object | string | Always "embedding" |
data[].embedding | float[] | Vector representation (dimensions vary by model) |
data[].index | integer | Index in the input array |
model | string | Model used for generation |
usage | object | Token usage statistics |
usage.prompt_tokens | integer | Input tokens processed |
usage.total_tokens | integer | Total tokens (same as prompt_tokens for embeddings) |
environment_impact | object | Environmental metrics (Melious extension) |
environment_impact.energy_kwh | float | Energy consumed in kilowatt-hours |
environment_impact.carbon_g_co2 | float | CO2 emissions in grams |
environment_impact.water_liters | float | Water consumption in liters |
environment_impact.renewable_percent | integer | Renewable electricity percentage (0-100) |
environment_impact.pue | float | Power Usage Effectiveness |
environment_impact.provider_id | string | Provider ID used |
environment_impact.location | string | Country code (ISO 3166-1 alpha-2) |
Status Codes:
| Code | Description |
|---|---|
200 | Success |
400 | Bad request - invalid parameters |
401 | Unauthorized - missing/invalid API key |
403 | Forbidden - insufficient permissions or energy |
429 | Rate limit exceeded |
500 | Internal server error |
Code Examples
import httpx
import asyncio
async def create_embeddings():
"""Generate embeddings for text input."""
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.melious.ai/v1/embeddings",
headers={"Authorization": "Bearer your_api_key"},
json={
"model": "qwen3-embedding-8b",
"input": "The quick brown fox jumps over the lazy dog"
}
)
data = response.json()
embedding = data["data"][0]["embedding"]
print(f"Generated {len(embedding)}-dimensional embedding")
print(f"CO2 emissions: {data['environment_impact']['carbon_g_co2']:.2f}g")
return embedding
# Batch embeddings example
async def create_batch_embeddings():
"""Generate embeddings for multiple texts."""
texts = [
"The quick brown fox",
"jumps over the lazy dog",
"Hello, world!"
]
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.melious.ai/v1/embeddings",
headers={"Authorization": "Bearer your_api_key"},
json={
"model": "qwen3-embedding-8b",
"input": texts,
"mode": "price" # Optimize for lowest cost
}
)
data = response.json()
embeddings = [item["embedding"] for item in data["data"]]
print(f"Generated {len(embeddings)} embeddings")
return embeddings
# Example usage
asyncio.run(create_embeddings())
asyncio.run(create_batch_embeddings())// Generate embeddings for text input
const createEmbeddings = async () => {
const response = await fetch(
'https://api.melious.ai/v1/embeddings',
{
method: 'POST',
headers: {
'Authorization': 'Bearer your_api_key',
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'qwen3-embedding-8b',
input: 'The quick brown fox jumps over the lazy dog'
})
}
);
const data = await response.json();
const embedding = data.data[0].embedding;
console.log(`Generated ${embedding.length}-dimensional embedding`);
console.log(`CO2 emissions: ${data.environment_impact.carbon_g_co2.toFixed(2)}g`);
return embedding;
};
// Batch embeddings example
const createBatchEmbeddings = async () => {
const texts = [
'The quick brown fox',
'jumps over the lazy dog',
'Hello, world!'
];
const response = await fetch(
'https://api.melious.ai/v1/embeddings',
{
method: 'POST',
headers: {
'Authorization': 'Bearer your_api_key',
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'qwen3-embedding-8b',
input: texts,
mode: 'price' // Optimize for lowest cost
})
}
);
const data = await response.json();
const embeddings = data.data.map(item => item.embedding);
console.log(`Generated ${embeddings.length} embeddings`);
return embeddings;
};
// Example usage
createEmbeddings();
createBatchEmbeddings();# Single text embedding
curl -X POST "https://api.melious.ai/v1/embeddings" \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-embedding-8b",
"input": "The quick brown fox jumps over the lazy dog"
}'
# Batch embeddings with routing
curl -X POST "https://api.melious.ai/v1/embeddings" \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-embedding-8b",
"input": ["Text 1", "Text 2", "Text 3"],
"mode": "price",
"filters": {
"countries": ["NL", "FR", "DE"],
"max_input_cost": 1.0
}
}'
# Long context embedding
curl -X POST "https://api.melious.ai/v1/embeddings" \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-embedding-8b",
"input": "Your long document text here..."
}'Error Handling
Handle errors gracefully by checking status codes and error messages. Implement exponential backoff for transient errors (5xx, 429).
Common Errors:
| Error Code | Description | Solution |
|---|---|---|
AUTH_INVALID_API_KEY | Invalid API key | Verify API key is correct and active |
VALIDATION_INVALID_VALUE | Invalid parameter | Check request body matches documentation |
INFERENCE_PROVIDER_ERROR | Provider request failed | Retry with exponential backoff or change routing mode |
BILLING_INSUFFICIENT_ENERGY | Not enough energy | Top up balance or upgrade plan |
INFERENCE_NO_PROVIDERS_AVAILABLE | No providers match filters | Relax filters or use different routing mode |
Error Response Format:
{
"status": "error",
"code": "INFERENCE_PROVIDER_ERROR",
"message": "All providers failed after 3 attempts",
"details": {
"providers_tried": ["nebius", "scaleway"],
"last_error": "Connection timeout"
}
}Best Practices
Batch Your Requests
Process multiple texts in a single request (up to 2048 items) to reduce latency and cost:
# ✅ Efficient: Single batch request
response = await client.post("/v1/embeddings", json={
"model": "qwen3-embedding-8b",
"input": ["text1", "text2", "text3", ...] # Up to 2048 items
})
# ❌ Inefficient: Multiple individual requests
for text in texts:
response = await client.post("/v1/embeddings", json={
"model": "qwen3-embedding-8b",
"input": text
})Choose the Right Model
Balance cost, performance, and quality based on your use case:
| Model | Brand | Context | Best For |
|---|---|---|---|
bge-m3 | BAAI | 8K | Multilingual, high performance |
bge-multilingual-gemma2 | BAAI | 8K | Gemma-based multilingual |
bge-en-icl | BAAI | 32K | Long context English |
bge-large-en-v1.5 | BAAI | 512 | English, high quality |
bge-base-en-v1.5 | BAAI | 512 | English, balanced |
qwen3-embedding-8b | Qwen | 32K | Long context |
e5-mistral-7b-instruct | intfloat | 32K | Instruction-tuned |
paraphrase-multilingual-mpnet | Sentence Transformers | 512 | Multilingual paraphrasing |
Optimize with Routing
Use Melious routing modes to optimize for your priorities:
# Optimize for lowest cost
response = await client.post("/v1/embeddings", json={
"model": "qwen3-embedding-8b",
"input": texts,
"mode": "price"
})
# Optimize for environmental impact
response = await client.post("/v1/embeddings", json={
"model": "qwen3-embedding-8b",
"input": texts,
"mode": "environment",
"filters": {
"max_carbon_intensity": 200, # g CO2/kWh
"countries": ["NL", "FR", "DE"] # European data residency
}
})Normalize Embeddings
Normalize embeddings to unit length for cosine similarity calculations:
import numpy as np
def normalize_embeddings(embedding):
"""Normalize embedding to unit length."""
norm = np.linalg.norm(embedding)
return embedding / norm if norm > 0 else embedding
# Usage
embedding = data["data"][0]["embedding"]
normalized = normalize_embeddings(np.array(embedding))Use Cases
Semantic Search
async def semantic_search(query: str, documents: list[str]):
"""Find most relevant documents using embeddings."""
# Generate embeddings for query and all documents
all_texts = [query] + documents
response = await client.post("/v1/embeddings", json={
"model": "qwen3-embedding-8b",
"input": all_texts
})
embeddings = [item["embedding"] for item in response.json()["data"]]
query_emb = np.array(embeddings[0])
doc_embs = np.array(embeddings[1:])
# Calculate cosine similarity
similarities = np.dot(doc_embs, query_emb) / (
np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(query_emb)
)
# Return top 3 most similar documents
top_indices = np.argsort(similarities)[-3:][::-1]
return [(documents[i], similarities[i]) for i in top_indices]Clustering
from sklearn.cluster import KMeans
async def cluster_documents(documents: list[str], n_clusters: int = 5):
"""Cluster documents by semantic similarity."""
# Generate embeddings
response = await client.post("/v1/embeddings", json={
"model": "qwen3-embedding-8b",
"input": documents,
"mode": "price"
})
embeddings = [item["embedding"] for item in response.json()["data"]]
# Perform k-means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(embeddings)
return clustersRetrieval-Augmented Generation (RAG)
async def rag_search(query: str, knowledge_base: list[str], top_k: int = 3):
"""Retrieve most relevant context for RAG."""
# Find most relevant documents
relevant_docs = await semantic_search(query, knowledge_base)
# Combine top documents as context
context = "\n\n".join([doc for doc, _ in relevant_docs[:top_k]])
# Use context with chat completion
chat_response = await client.post("/v1/chat/completions", json={
"model": "gpt-oss-120b",
"messages": [
{"role": "system", "content": f"Context:\n{context}"},
{"role": "user", "content": query}
]
})
return chat_response.json()["choices"][0]["message"]["content"]Performance
Typical Latencies:
| Request Type | Latency (p50) | Latency (p95) |
|---|---|---|
| Single input | 50-150ms | 200-400ms |
| Batch (10 items) | 100-300ms | 400-800ms |
| Batch (100 items) | 500-1500ms | 2-4s |
Optimization Tips:
- Batch requests - Process multiple texts in one request
- Choose efficient models -
qwen3-embedding-8bfor general use,bge-m3for multilingual - Match context to model - Use short-context models (512) for short texts
- Enable caching - Cache embeddings for frequently used texts
- Use routing -
mode: "speed"for lowest latency
See Also
- Chat Completions - Generate text with LLMs
- Models - List available models
- Streaming - Real-time responses