Embeddings

Overview

Generate high-dimensional vector embeddings from text input using state-of-the-art embedding models. Perfect for semantic search, RAG (Retrieval-Augmented Generation), clustering, and similarity analysis.

Key Features:

OpenAI-compatible API for easy migration
Multi-provider routing for best price/performance
Environment impact tracking (CO2, energy, water)
Batch embedding support (up to 2048 inputs per request)
Automatic failover and retry logic

Embeddings are vector representations of text that capture semantic meaning. Similar texts have similar embeddings (measured by cosine similarity).

Authentication

Required: API Key

All requests must include your Melious API key in the Authorization header:

Authorization: Bearer {your_api_key}

Permissions: embeddings.create scope required.

Endpoints

Create Embeddings

POST /v1/embeddings

Generate vector embeddings from text input (single string or array of strings).

Request Body:

{
  "model": "qwen3-embedding-8b",
  "input": "The quick brown fox jumps over the lazy dog",
  "encoding_format": "float"
}

Request Fields:

Field	Type	Required	Description
`model`	`string`	Yes	Model ID (e.g., `bge-m3`, `qwen3-embedding-8b`)
`input`	`string` \| `string[]`	Yes	Text to embed (single string or array, max 2048 items)
`encoding_format`	`string`	No	Format for embeddings: `"float"` (default) or `"base64"`
`dimensions`	`integer`	No	Number of dimensions (model-specific)
`user`	`string`	No	End-user identifier for abuse monitoring
Melious Extensions
`mode`	`string`	No	Routing mode: `"balanced"`, `"speed"`, `"price"`, `"quality"`, `"environment"`
`custom_weights`	`object`	No	Custom routing weights (mutually exclusive with `mode`)
`filters`	`object`	No	Hard constraints for provider selection

Routing Filters (optional):

{
  "filters": {
    "countries": ["NL", "FR", "DE"],
    "max_input_cost": 1.0,
    "max_carbon_intensity": 300,
    "min_speed_tps": 500
  }
}

Response (200 OK):

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.023, -0.015, 0.042, ...],
      "index": 0
    }
  ],
  "model": "qwen3-embedding-8b",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  },
  "environment_impact": {
    "energy_kwh": 0.0001,
    "carbon_g_co2": 0.04,
    "water_liters": 0.0001,
    "renewable_percent": 95,
    "pue": 1.15,
    "provider_id": "nebius",
    "location": "NL"
  }
}

Response Fields:

Field	Type	Description
`object`	`string`	Always `"list"`
`data`	`array`	Array of embedding objects
`data[].object`	`string`	Always `"embedding"`
`data[].embedding`	`float[]`	Vector representation (dimensions vary by model)
`data[].index`	`integer`	Index in the input array
`model`	`string`	Model used for generation
`usage`	`object`	Token usage statistics
`usage.prompt_tokens`	`integer`	Input tokens processed
`usage.total_tokens`	`integer`	Total tokens (same as prompt_tokens for embeddings)
`environment_impact`	`object`	Environmental metrics (Melious extension)
`environment_impact.energy_kwh`	`float`	Energy consumed in kilowatt-hours
`environment_impact.carbon_g_co2`	`float`	CO2 emissions in grams
`environment_impact.water_liters`	`float`	Water consumption in liters
`environment_impact.renewable_percent`	`integer`	Renewable electricity percentage (0-100)
`environment_impact.pue`	`float`	Power Usage Effectiveness
`environment_impact.provider_id`	`string`	Provider ID used
`environment_impact.location`	`string`	Country code (ISO 3166-1 alpha-2)

Status Codes:

Code	Description
`200`	Success
`400`	Bad request - invalid parameters
`401`	Unauthorized - missing/invalid API key
`403`	Forbidden - insufficient permissions or energy
`429`	Rate limit exceeded
`500`	Internal server error

Code Examples

import httpx
import asyncio

async def create_embeddings():
    """Generate embeddings for text input."""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.melious.ai/v1/embeddings",
            headers={"Authorization": "Bearer your_api_key"},
            json={
                "model": "qwen3-embedding-8b",
                "input": "The quick brown fox jumps over the lazy dog"
            }
        )
        data = response.json()
        embedding = data["data"][0]["embedding"]
        print(f"Generated {len(embedding)}-dimensional embedding")
        print(f"CO2 emissions: {data['environment_impact']['carbon_g_co2']:.2f}g")
        return embedding

# Batch embeddings example
async def create_batch_embeddings():
    """Generate embeddings for multiple texts."""
    texts = [
        "The quick brown fox",
        "jumps over the lazy dog",
        "Hello, world!"
    ]

    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.melious.ai/v1/embeddings",
            headers={"Authorization": "Bearer your_api_key"},
            json={
                "model": "qwen3-embedding-8b",
                "input": texts,
                "mode": "price"  # Optimize for lowest cost
            }
        )
        data = response.json()
        embeddings = [item["embedding"] for item in data["data"]]
        print(f"Generated {len(embeddings)} embeddings")
        return embeddings

# Example usage
asyncio.run(create_embeddings())
asyncio.run(create_batch_embeddings())

// Generate embeddings for text input
const createEmbeddings = async () => {
  const response = await fetch(
    'https://api.melious.ai/v1/embeddings',
    {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer your_api_key',
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'qwen3-embedding-8b',
        input: 'The quick brown fox jumps over the lazy dog'
      })
    }
  );

  const data = await response.json();
  const embedding = data.data[0].embedding;
  console.log(`Generated ${embedding.length}-dimensional embedding`);
  console.log(`CO2 emissions: ${data.environment_impact.carbon_g_co2.toFixed(2)}g`);
  return embedding;
};

// Batch embeddings example
const createBatchEmbeddings = async () => {
  const texts = [
    'The quick brown fox',
    'jumps over the lazy dog',
    'Hello, world!'
  ];

  const response = await fetch(
    'https://api.melious.ai/v1/embeddings',
    {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer your_api_key',
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'qwen3-embedding-8b',
        input: texts,
        mode: 'price'  // Optimize for lowest cost
      })
    }
  );

  const data = await response.json();
  const embeddings = data.data.map(item => item.embedding);
  console.log(`Generated ${embeddings.length} embeddings`);
  return embeddings;
};

// Example usage
createEmbeddings();
createBatchEmbeddings();

# Single text embedding
curl -X POST "https://api.melious.ai/v1/embeddings" \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8b",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

# Batch embeddings with routing
curl -X POST "https://api.melious.ai/v1/embeddings" \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8b",
    "input": ["Text 1", "Text 2", "Text 3"],
    "mode": "price",
    "filters": {
      "countries": ["NL", "FR", "DE"],
      "max_input_cost": 1.0
    }
  }'

# Long context embedding
curl -X POST "https://api.melious.ai/v1/embeddings" \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8b",
    "input": "Your long document text here..."
  }'

Error Handling

Handle errors gracefully by checking status codes and error messages. Implement exponential backoff for transient errors (5xx, 429).

Common Errors:

Error Code	Description	Solution
`AUTH_INVALID_API_KEY`	Invalid API key	Verify API key is correct and active
`VALIDATION_INVALID_VALUE`	Invalid parameter	Check request body matches documentation
`INFERENCE_PROVIDER_ERROR`	Provider request failed	Retry with exponential backoff or change routing mode
`BILLING_INSUFFICIENT_ENERGY`	Not enough energy	Top up balance or upgrade plan
`INFERENCE_NO_PROVIDERS_AVAILABLE`	No providers match filters	Relax filters or use different routing mode

Error Response Format:

{
  "status": "error",
  "code": "INFERENCE_PROVIDER_ERROR",
  "message": "All providers failed after 3 attempts",
  "details": {
    "providers_tried": ["nebius", "scaleway"],
    "last_error": "Connection timeout"
  }
}

Best Practices

Batch Your Requests

Process multiple texts in a single request (up to 2048 items) to reduce latency and cost:

# ✅ Efficient: Single batch request
response = await client.post("/v1/embeddings", json={
    "model": "qwen3-embedding-8b",
    "input": ["text1", "text2", "text3", ...]  # Up to 2048 items
})

# ❌ Inefficient: Multiple individual requests
for text in texts:
    response = await client.post("/v1/embeddings", json={
        "model": "qwen3-embedding-8b",
        "input": text
    })

Choose the Right Model

Balance cost, performance, and quality based on your use case:

Model	Brand	Context	Best For
`bge-m3`	BAAI	8K	Multilingual, high performance
`bge-multilingual-gemma2`	BAAI	8K	Gemma-based multilingual
`bge-en-icl`	BAAI	32K	Long context English
`bge-large-en-v1.5`	BAAI	512	English, high quality
`bge-base-en-v1.5`	BAAI	512	English, balanced
`qwen3-embedding-8b`	Qwen	32K	Long context
`e5-mistral-7b-instruct`	intfloat	32K	Instruction-tuned
`paraphrase-multilingual-mpnet`	Sentence Transformers	512	Multilingual paraphrasing

Optimize with Routing

Use Melious routing modes to optimize for your priorities:

# Optimize for lowest cost
response = await client.post("/v1/embeddings", json={
    "model": "qwen3-embedding-8b",
    "input": texts,
    "mode": "price"
})

# Optimize for environmental impact
response = await client.post("/v1/embeddings", json={
    "model": "qwen3-embedding-8b",
    "input": texts,
    "mode": "environment",
    "filters": {
        "max_carbon_intensity": 200,  # g CO2/kWh
        "countries": ["NL", "FR", "DE"]  # European data residency
    }
})

Normalize Embeddings

Normalize embeddings to unit length for cosine similarity calculations:

import numpy as np

def normalize_embeddings(embedding):
    """Normalize embedding to unit length."""
    norm = np.linalg.norm(embedding)
    return embedding / norm if norm > 0 else embedding

# Usage
embedding = data["data"][0]["embedding"]
normalized = normalize_embeddings(np.array(embedding))

Use Cases

Semantic Search

async def semantic_search(query: str, documents: list[str]):
    """Find most relevant documents using embeddings."""
    # Generate embeddings for query and all documents
    all_texts = [query] + documents
    response = await client.post("/v1/embeddings", json={
        "model": "qwen3-embedding-8b",
        "input": all_texts
    })

    embeddings = [item["embedding"] for item in response.json()["data"]]
    query_emb = np.array(embeddings[0])
    doc_embs = np.array(embeddings[1:])

    # Calculate cosine similarity
    similarities = np.dot(doc_embs, query_emb) / (
        np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(query_emb)
    )

    # Return top 3 most similar documents
    top_indices = np.argsort(similarities)[-3:][::-1]
    return [(documents[i], similarities[i]) for i in top_indices]

Clustering

from sklearn.cluster import KMeans

async def cluster_documents(documents: list[str], n_clusters: int = 5):
    """Cluster documents by semantic similarity."""
    # Generate embeddings
    response = await client.post("/v1/embeddings", json={
        "model": "qwen3-embedding-8b",
        "input": documents,
        "mode": "price"
    })

    embeddings = [item["embedding"] for item in response.json()["data"]]

    # Perform k-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)

    return clusters

Retrieval-Augmented Generation (RAG)

async def rag_search(query: str, knowledge_base: list[str], top_k: int = 3):
    """Retrieve most relevant context for RAG."""
    # Find most relevant documents
    relevant_docs = await semantic_search(query, knowledge_base)

    # Combine top documents as context
    context = "\n\n".join([doc for doc, _ in relevant_docs[:top_k]])

    # Use context with chat completion
    chat_response = await client.post("/v1/chat/completions", json={
        "model": "gpt-oss-120b",
        "messages": [
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": query}
        ]
    })

    return chat_response.json()["choices"][0]["message"]["content"]

Performance

Typical Latencies:

Request Type	Latency (p50)	Latency (p95)
Single input	50-150ms	200-400ms
Batch (10 items)	100-300ms	400-800ms
Batch (100 items)	500-1500ms	2-4s

Optimization Tips:

Batch requests - Process multiple texts in one request
Choose efficient models - qwen3-embedding-8b for general use, bge-m3 for multilingual
Match context to model - Use short-context models (512) for short texts
Enable caching - Cache embeddings for frequently used texts
Use routing - mode: "speed" for lowest latency