Melious
Inference

Embeddings

Generate vector embeddings with OpenAI-compatible API

Overview

Generate high-dimensional vector embeddings from text input using state-of-the-art embedding models. Perfect for semantic search, RAG (Retrieval-Augmented Generation), clustering, and similarity analysis.

Key Features:

  • OpenAI-compatible API for easy migration
  • Multi-provider routing for best price/performance
  • Environment impact tracking (CO2, energy, water)
  • Batch embedding support (up to 2048 inputs per request)
  • Automatic failover and retry logic

Embeddings are vector representations of text that capture semantic meaning. Similar texts have similar embeddings (measured by cosine similarity).


Authentication

Required: API Key

All requests must include your Melious API key in the Authorization header:

Authorization: Bearer {your_api_key}

Permissions: embeddings.create scope required.


Endpoints

Create Embeddings

POST /v1/embeddings

Generate vector embeddings from text input (single string or array of strings).

Request Body:

{
  "model": "qwen3-embedding-8b",
  "input": "The quick brown fox jumps over the lazy dog",
  "encoding_format": "float"
}

Request Fields:

FieldTypeRequiredDescription
modelstringYesModel ID (e.g., bge-m3, qwen3-embedding-8b)
inputstring | string[]YesText to embed (single string or array, max 2048 items)
encoding_formatstringNoFormat for embeddings: "float" (default) or "base64"
dimensionsintegerNoNumber of dimensions (model-specific)
userstringNoEnd-user identifier for abuse monitoring
Melious Extensions
modestringNoRouting mode: "balanced", "speed", "price", "quality", "environment"
custom_weightsobjectNoCustom routing weights (mutually exclusive with mode)
filtersobjectNoHard constraints for provider selection

Routing Filters (optional):

{
  "filters": {
    "countries": ["NL", "FR", "DE"],
    "max_input_cost": 1.0,
    "max_carbon_intensity": 300,
    "min_speed_tps": 500
  }
}

Response (200 OK):

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.023, -0.015, 0.042, ...],
      "index": 0
    }
  ],
  "model": "qwen3-embedding-8b",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  },
  "environment_impact": {
    "energy_kwh": 0.0001,
    "carbon_g_co2": 0.04,
    "water_liters": 0.0001,
    "renewable_percent": 95,
    "pue": 1.15,
    "provider_id": "nebius",
    "location": "NL"
  }
}

Response Fields:

FieldTypeDescription
objectstringAlways "list"
dataarrayArray of embedding objects
data[].objectstringAlways "embedding"
data[].embeddingfloat[]Vector representation (dimensions vary by model)
data[].indexintegerIndex in the input array
modelstringModel used for generation
usageobjectToken usage statistics
usage.prompt_tokensintegerInput tokens processed
usage.total_tokensintegerTotal tokens (same as prompt_tokens for embeddings)
environment_impactobjectEnvironmental metrics (Melious extension)
environment_impact.energy_kwhfloatEnergy consumed in kilowatt-hours
environment_impact.carbon_g_co2floatCO2 emissions in grams
environment_impact.water_litersfloatWater consumption in liters
environment_impact.renewable_percentintegerRenewable electricity percentage (0-100)
environment_impact.puefloatPower Usage Effectiveness
environment_impact.provider_idstringProvider ID used
environment_impact.locationstringCountry code (ISO 3166-1 alpha-2)

Status Codes:

CodeDescription
200Success
400Bad request - invalid parameters
401Unauthorized - missing/invalid API key
403Forbidden - insufficient permissions or energy
429Rate limit exceeded
500Internal server error

Code Examples

import httpx
import asyncio

async def create_embeddings():
    """Generate embeddings for text input."""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.melious.ai/v1/embeddings",
            headers={"Authorization": "Bearer your_api_key"},
            json={
                "model": "qwen3-embedding-8b",
                "input": "The quick brown fox jumps over the lazy dog"
            }
        )
        data = response.json()
        embedding = data["data"][0]["embedding"]
        print(f"Generated {len(embedding)}-dimensional embedding")
        print(f"CO2 emissions: {data['environment_impact']['carbon_g_co2']:.2f}g")
        return embedding

# Batch embeddings example
async def create_batch_embeddings():
    """Generate embeddings for multiple texts."""
    texts = [
        "The quick brown fox",
        "jumps over the lazy dog",
        "Hello, world!"
    ]

    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.melious.ai/v1/embeddings",
            headers={"Authorization": "Bearer your_api_key"},
            json={
                "model": "qwen3-embedding-8b",
                "input": texts,
                "mode": "price"  # Optimize for lowest cost
            }
        )
        data = response.json()
        embeddings = [item["embedding"] for item in data["data"]]
        print(f"Generated {len(embeddings)} embeddings")
        return embeddings

# Example usage
asyncio.run(create_embeddings())
asyncio.run(create_batch_embeddings())
// Generate embeddings for text input
const createEmbeddings = async () => {
  const response = await fetch(
    'https://api.melious.ai/v1/embeddings',
    {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer your_api_key',
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'qwen3-embedding-8b',
        input: 'The quick brown fox jumps over the lazy dog'
      })
    }
  );

  const data = await response.json();
  const embedding = data.data[0].embedding;
  console.log(`Generated ${embedding.length}-dimensional embedding`);
  console.log(`CO2 emissions: ${data.environment_impact.carbon_g_co2.toFixed(2)}g`);
  return embedding;
};

// Batch embeddings example
const createBatchEmbeddings = async () => {
  const texts = [
    'The quick brown fox',
    'jumps over the lazy dog',
    'Hello, world!'
  ];

  const response = await fetch(
    'https://api.melious.ai/v1/embeddings',
    {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer your_api_key',
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'qwen3-embedding-8b',
        input: texts,
        mode: 'price'  // Optimize for lowest cost
      })
    }
  );

  const data = await response.json();
  const embeddings = data.data.map(item => item.embedding);
  console.log(`Generated ${embeddings.length} embeddings`);
  return embeddings;
};

// Example usage
createEmbeddings();
createBatchEmbeddings();
# Single text embedding
curl -X POST "https://api.melious.ai/v1/embeddings" \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8b",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

# Batch embeddings with routing
curl -X POST "https://api.melious.ai/v1/embeddings" \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8b",
    "input": ["Text 1", "Text 2", "Text 3"],
    "mode": "price",
    "filters": {
      "countries": ["NL", "FR", "DE"],
      "max_input_cost": 1.0
    }
  }'

# Long context embedding
curl -X POST "https://api.melious.ai/v1/embeddings" \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8b",
    "input": "Your long document text here..."
  }'

Error Handling

Handle errors gracefully by checking status codes and error messages. Implement exponential backoff for transient errors (5xx, 429).

Common Errors:

Error CodeDescriptionSolution
AUTH_INVALID_API_KEYInvalid API keyVerify API key is correct and active
VALIDATION_INVALID_VALUEInvalid parameterCheck request body matches documentation
INFERENCE_PROVIDER_ERRORProvider request failedRetry with exponential backoff or change routing mode
BILLING_INSUFFICIENT_ENERGYNot enough energyTop up balance or upgrade plan
INFERENCE_NO_PROVIDERS_AVAILABLENo providers match filtersRelax filters or use different routing mode

Error Response Format:

{
  "status": "error",
  "code": "INFERENCE_PROVIDER_ERROR",
  "message": "All providers failed after 3 attempts",
  "details": {
    "providers_tried": ["nebius", "scaleway"],
    "last_error": "Connection timeout"
  }
}

Best Practices

Batch Your Requests

Process multiple texts in a single request (up to 2048 items) to reduce latency and cost:

# ✅ Efficient: Single batch request
response = await client.post("/v1/embeddings", json={
    "model": "qwen3-embedding-8b",
    "input": ["text1", "text2", "text3", ...]  # Up to 2048 items
})

# ❌ Inefficient: Multiple individual requests
for text in texts:
    response = await client.post("/v1/embeddings", json={
        "model": "qwen3-embedding-8b",
        "input": text
    })

Choose the Right Model

Balance cost, performance, and quality based on your use case:

ModelBrandContextBest For
bge-m3BAAI8KMultilingual, high performance
bge-multilingual-gemma2BAAI8KGemma-based multilingual
bge-en-iclBAAI32KLong context English
bge-large-en-v1.5BAAI512English, high quality
bge-base-en-v1.5BAAI512English, balanced
qwen3-embedding-8bQwen32KLong context
e5-mistral-7b-instructintfloat32KInstruction-tuned
paraphrase-multilingual-mpnetSentence Transformers512Multilingual paraphrasing

Optimize with Routing

Use Melious routing modes to optimize for your priorities:

# Optimize for lowest cost
response = await client.post("/v1/embeddings", json={
    "model": "qwen3-embedding-8b",
    "input": texts,
    "mode": "price"
})

# Optimize for environmental impact
response = await client.post("/v1/embeddings", json={
    "model": "qwen3-embedding-8b",
    "input": texts,
    "mode": "environment",
    "filters": {
        "max_carbon_intensity": 200,  # g CO2/kWh
        "countries": ["NL", "FR", "DE"]  # European data residency
    }
})

Normalize Embeddings

Normalize embeddings to unit length for cosine similarity calculations:

import numpy as np

def normalize_embeddings(embedding):
    """Normalize embedding to unit length."""
    norm = np.linalg.norm(embedding)
    return embedding / norm if norm > 0 else embedding

# Usage
embedding = data["data"][0]["embedding"]
normalized = normalize_embeddings(np.array(embedding))

Use Cases

async def semantic_search(query: str, documents: list[str]):
    """Find most relevant documents using embeddings."""
    # Generate embeddings for query and all documents
    all_texts = [query] + documents
    response = await client.post("/v1/embeddings", json={
        "model": "qwen3-embedding-8b",
        "input": all_texts
    })

    embeddings = [item["embedding"] for item in response.json()["data"]]
    query_emb = np.array(embeddings[0])
    doc_embs = np.array(embeddings[1:])

    # Calculate cosine similarity
    similarities = np.dot(doc_embs, query_emb) / (
        np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(query_emb)
    )

    # Return top 3 most similar documents
    top_indices = np.argsort(similarities)[-3:][::-1]
    return [(documents[i], similarities[i]) for i in top_indices]

Clustering

from sklearn.cluster import KMeans

async def cluster_documents(documents: list[str], n_clusters: int = 5):
    """Cluster documents by semantic similarity."""
    # Generate embeddings
    response = await client.post("/v1/embeddings", json={
        "model": "qwen3-embedding-8b",
        "input": documents,
        "mode": "price"
    })

    embeddings = [item["embedding"] for item in response.json()["data"]]

    # Perform k-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)

    return clusters

Retrieval-Augmented Generation (RAG)

async def rag_search(query: str, knowledge_base: list[str], top_k: int = 3):
    """Retrieve most relevant context for RAG."""
    # Find most relevant documents
    relevant_docs = await semantic_search(query, knowledge_base)

    # Combine top documents as context
    context = "\n\n".join([doc for doc, _ in relevant_docs[:top_k]])

    # Use context with chat completion
    chat_response = await client.post("/v1/chat/completions", json={
        "model": "gpt-oss-120b",
        "messages": [
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": query}
        ]
    })

    return chat_response.json()["choices"][0]["message"]["content"]

Performance

Typical Latencies:

Request TypeLatency (p50)Latency (p95)
Single input50-150ms200-400ms
Batch (10 items)100-300ms400-800ms
Batch (100 items)500-1500ms2-4s

Optimization Tips:

  1. Batch requests - Process multiple texts in one request
  2. Choose efficient models - qwen3-embedding-8b for general use, bge-m3 for multilingual
  3. Match context to model - Use short-context models (512) for short texts
  4. Enable caching - Cache embeddings for frequently used texts
  5. Use routing - mode: "speed" for lowest latency

See Also

On this page