OpenAI Embeddings

POST /v1/embeddings returns vector embeddings for one or more input strings. The wire format matches OpenAI's, so any client that already calls client.embeddings.create(...) works against Bodhi by changing base_url and the API key.

Use this for RAG, semantic search, retrieval, classification — anything that needs to reduce text to a fixed-length numeric vector.

Auth

Authorization: Bearer <bodhi-api-token>

Same Bodhi-issued token that works against /v1/chat/completions. See API Tokens.

Model resolution

The model field resolves against Bodhi's combined catalog, exactly like Chat Completions:

  • Local embedding aliases — a GGUF embedding model (e.g. nomic-embed, bge, e5) loaded via llama.cpp, configured as a model alias.
  • Remote embedding API models — provider-hosted embedding endpoints (OpenAI's text-embedding-3-*, Gemini's embedding models, etc.), configured as API models.

Hit GET /v1/models to see what's available. The combined catalog includes embedding models alongside chat models — pick the one you need by name.

For setting up remote providers (and choosing one that supports embeddings), see API Models.

Local vs remote — when to pick which

  • Local is a fixed-cost workhorse. Once the model is loaded, embedding 10,000 chunks costs no per-token spend; only your CPU/GPU time. Good for offline processing, batch jobs, on-prem RAG.
  • Remote is pay-per-token but typically higher quality at the top tier. Good when you need a single small model from a provider with strong recall, or when you don't want to manage a local embedding model.

You can mix and match: use a local embedding model for ingestion (cheap, bulk) and a remote one for query-time embedding (fewer calls, better recall) — or vice versa. Both go through the same wire format.

Examples

curl

curl -X POST http://localhost:1135/v1/embeddings \
  -H "Authorization: Bearer $BODHI_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-embedding-alias",
    "input": "The quick brown fox"
  }'

You can pass a single string or an array of strings as input. Batch sizes are bounded by the underlying model — see Swagger UI for limits.

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1135/v1",
    api_key=os.environ["BODHI_TOKEN"],
)

resp = client.embeddings.create(
    model="your-embedding-alias",
    input=["First chunk of text", "Second chunk of text"],
)
vectors = [item.embedding for item in resp.data]  # list[list[float]]

The response shape is the standard OpenAI Embedding object (data[].embedding, usage, etc.). Vector dimensionality is whatever the underlying model produces — Swagger UI documents the response schema; the model card on the provider's side tells you the dimension count.

Common gotchas

  • Chat model used as embedding model. If you pass a chat model name, the request fails — embedding endpoints reject non-embedding aliases.
  • Mismatched dimensions. If you swap from a 768-dim local model to a 1536-dim remote model in production, your existing vector store will reject the new vectors. Plan migrations carefully.
  • Batch limits. Local llama.cpp has its own batch ceiling; remote providers have theirs. The error message tells you which side rejected.

Full schema

See Swagger UI at http://<your-bodhi-instance>/swagger-ui for the request body, response shape, and provider-specific options. Default local URL: http://localhost:1135/swagger-ui.