Optimize tool discovery

When Virtual MCP Server (vMCP) aggregates many backend MCP servers, the total number of tools exposed to clients can grow quickly. The optimizer addresses this by filtering tools per request, reducing token usage and improving tool selection accuracy.

This guide covers configuration for Kubernetes deployments and local CLI use. For a step-by-step Kubernetes tutorial, see the MCP Optimizer tutorial.

Quick start (Kubernetes)

Step 1: Create an EmbeddingServer

Create an EmbeddingServer with default settings. This deploys a text embeddings inference (TEI) server using the BAAI/bge-small-en-v1.5 model:

embedding-server.yaml
apiVersion: toolhive.stacklok.dev/v1beta1
kind: EmbeddingServer
metadata:
  name: my-embedding
  namespace: toolhive-system
spec: {}

tip

Wait for the EmbeddingServer to reach the Ready phase before proceeding. The first startup may take a few minutes while the model downloads.

kubectl get embeddingserver my-embedding -n toolhive-system -w

Step 2: Add the embedding reference to VirtualMCPServer

Update your existing VirtualMCPServer to include embeddingServerRef. This is the only change needed to enable the optimizer. When you set embeddingServerRef, the operator automatically enables the optimizer with sensible defaults. You only need to add an explicit optimizer block if you want to tune the parameters.

VirtualMCPServer resource
apiVersion: toolhive.stacklok.dev/v1beta1
kind: VirtualMCPServer
metadata:
  name: my-vmcp
  namespace: toolhive-system
spec:
  embeddingServerRef:
    name: my-embedding
  groupRef:
    name: my-group
  incomingAuth:
    type: anonymous

Step 3: Verify

Check that the VirtualMCPServer is ready:

kubectl get virtualmcpserver my-vmcp -n toolhive-system

Look for READY: True in the output. Once ready, clients connecting to the vMCP endpoint see only find_tool and call_tool instead of the full backend toolset.

EmbeddingServer resource

The EmbeddingServer CRD manages the lifecycle of a managed TEI server, which is the default embedding backend. If you'd rather point the optimizer at an external OpenAI-compatible embedding service instead, see Use an OpenAI-compatible embedding service below.

An empty spec: {} uses all defaults. The two most important fields you can customize are:

model: The Hugging Face embedding model to use. The default (BAAI/bge-small-en-v1.5) is the tested and recommended model. You can substitute any embedding model available on Hugging Face. See the MTEB leaderboard to compare options.
image: The container image for text-embeddings-inference (TEI). The default is the CPU-only image (ghcr.io/huggingface/text-embeddings-inference:cpu-latest). Swap this for a CUDA-enabled image if you have GPU nodes available.

For the complete field reference, see the EmbeddingServer CRD specification.

ARM64 support

The default TEI image (cpu-latest) is x86_64-only. If you are running on ARM64 nodes (for example, Apple Silicon), override the image in your EmbeddingServer:

embedding-server.yaml
apiVersion: toolhive.stacklok.dev/v1beta1
kind: EmbeddingServer
metadata:
  name: my-embedding
  namespace: toolhive-system
spec:
  image: ghcr.io/huggingface/text-embeddings-inference:cpu-arm64-latest

Use an OpenAI-compatible embedding service

Instead of running a managed TEI EmbeddingServer, you can point the optimizer at an external service that speaks the OpenAI /embeddings API, such as OpenAI itself, Azure OpenAI, or another OpenAI-compatible gateway. Use this when you already operate a centralized embedding service and don't want a second copy running per vMCP, or when you need a hosted model.

Set embeddingProvider: openai under spec.config.optimizer and configure embeddingService and embeddingModel directly. Do not set embeddingServerRef; the operator rejects combining the two at admission.

VirtualMCPServer resource
apiVersion: toolhive.stacklok.dev/v1beta1
kind: VirtualMCPServer
metadata:
  name: optimizer-vmcp
  namespace: toolhive-system
spec:
  groupRef:
    name: my-group
  config:
    optimizer:
      embeddingProvider: openai
      embeddingService: http://llm-gateway.default.svc.cluster.local:8080/v1
      embeddingModel: text-embedding-3-small
      embeddingServiceTimeout: 15s
  incomingAuth:
    type: anonymous

embeddingService is the base URL of the OpenAI-compatible endpoint; /embeddings is appended automatically. embeddingModel is the model name passed in each request and is required for the openai provider (the tei provider ignores it, because the model is fixed by the TEI container).

The API key for the embedding service is read from the OPENAI_API_KEY environment variable on the vmcp container, never from the CRD spec or ConfigMap. Inject it from a Secret via podTemplateSpec:

VirtualMCPServer resource (excerpt)
spec:
  podTemplateSpec:
    spec:
      containers:
        - name: vmcp
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: embedding-api-key
                  key: apiKey

Omit the env var entirely if your gateway is keyless (for example, an in-cluster LLM gateway that authenticates by network position). An empty key omits the Authorization header.

If your embedding gateway needs additional HTTP headers on every request (for routing, tenant scoping, or caching), add them under optimizer.embeddingHeaders. Header values are stored in plain text on the resource and generated ConfigMap, so use this only for non-secret values:

VirtualMCPServer resource
spec:
  config:
    optimizer:
      embeddingProvider: openai
      embeddingService: http://llm-gateway.default.svc.cluster.local:8080/v1
      embeddingModel: text-embedding-3-small
      embeddingHeaders:
        X-Tenant-Id: acme
        X-Cache-Scope: embeddings-prod

embeddingHeaders is only accepted when embeddingProvider is openai (the managed TEI path ignores it). Header names must be valid RFC 7230 tokens. Authorization and Content-Type are reserved: the client sets Authorization from OPENAI_API_KEY and sends Content-Type itself, and neither can be overridden through embeddingHeaders.

Inputs are not truncated

Unlike the TEI backend, the OpenAI API does not silently truncate over-long inputs. A tool description that exceeds the model's context window causes the request to fail with an error rather than being truncated.

When embeddingProvider is omitted, the optimizer defaults to tei and your existing TEI-based configuration continues to work unchanged.

Local mode (CLI)

You can enable the optimizer directly from the thv vmcp CLI without a Kubernetes cluster.

Tier 1 — keyword search

Tier 1 uses FTS5 full-text search running in-process. No external service or container is required:

thv vmcp serve --group my-group --optimizer

Or add it to an existing config file:

vmcp.yaml
optimizer: {}

Then start the server with:

thv vmcp serve --config vmcp.yaml

Tier 2 — managed TEI container

Tier 2 adds vector similarity search on top of keyword search. ToolHive automatically starts and stops a HuggingFace Text Embeddings Inference (TEI) container. A container runtime (Docker, Podman, or OrbStack) must be available:

thv vmcp serve --group my-group --optimizer-embedding

To customize the model or image used for the auto-managed container:

thv vmcp serve --group my-group --optimizer-embedding \
  --embedding-model BAAI/bge-small-en-v1.5 \
  --embedding-image ghcr.io/huggingface/text-embeddings-inference:cpu-latest

Tier 3 — external embedding service

Tier 3 uses an embedding server you already manage. No container runtime is required. Set embeddingService in your existing config file to point at the server:

vmcp.yaml
optimizer:
  embeddingService: http://127.0.0.1:8090

Then start the server with:

thv vmcp serve --config vmcp.yaml

For the full optimizer tier comparison, see the local CLI guide.

Benefits

Reduced token usage: Only relevant tools are included in context, not the entire toolset
Improved tool selection: The right tools surface for each query. With fewer tools to reason over, agents are more likely to choose correctly

How it works

You send a prompt that requires tool assistance
The AI calls find_tool with keywords extracted from the prompt
vMCP performs hybrid semantic and keyword search across all backend tools
Only the most relevant tools (up to 8 by default) are returned
The AI calls call_tool to execute the selected tool, and vMCP routes the request to the appropriate backend

How search works internally

The optimizer uses an internal SQLite database for both keyword search (using full-text search) and storing semantic vectors. Keyword search runs locally against this database; semantic search uses vectors generated by an embedding server. To control how results from these two sources are blended, see the parameter reference.

Tune the optimizer

To customize optimizer behavior, add the optimizer block under spec.config in your VirtualMCPServer resource:

VirtualMCPServer resource
spec:
  groupRef:
    name: my-group
  config:
    optimizer:
      embeddingServiceTimeout: 30s
      maxToolsToReturn: 8
      hybridSearchSemanticRatio: '0.5'
      semanticDistanceThreshold: '1.0'

Parameter reference

Field	Type	Description
`embeddingHeaders`	`map<string, string>`	EmbeddingHeaders holds additional HTTP headers sent with every embedding request. Only supported when EmbeddingProvider is "openai". Values are stored in plain text and must not contain secrets; Authorization (derived from OPENAI_API_KEY) and Content-Type cannot be set.
`embeddingModel`	`string`	EmbeddingModel is the model name requested from the embedding service (e.g. "text-embedding-3-small"). Required when EmbeddingProvider is "openai". Ignored for the "tei" provider, where the model is fixed by the running TEI container. The API key for an OpenAI-compatible service is not configured here: it is read from the OPENAI_API_KEY environment variable so the secret never lands in a CRD spec or ConfigMap. An empty key omits the Authorization header, which supports keyless in-cluster gateways.
`embeddingProvider`	`string`	EmbeddingProvider selects the wire protocol used to talk to the embedding service. "tei" speaks the HuggingFace Text Embeddings Inference API; "openai" speaks the OpenAI-compatible /embeddings API, which lets the optimizer use OpenAI, Azure OpenAI, or another OpenAI-compatible gateway. Defaults to "tei" when empty. The "openai" provider reads EmbeddingService directly and cannot be combined with EmbeddingServerRef, which provisions a managed TEI server; the operator rejects that combination at admission. default `"tei"` · enum: `tei` \| `openai`
`embeddingServiceTimeout`	`string`	EmbeddingServiceTimeout is the HTTP request timeout for calls to the embedding service. Defaults to 30s if not specified. default `"30s"` · pattern `^([0-9]+(\.[0-9]+)?(ns\|us\|µs\|ms\|s\|m\|h))+$`
`hybridSearchSemanticRatio`	`string`	HybridSearchSemanticRatio controls the balance between semantic (meaning-based) and keyword search results. 0.0 = all keyword, 1.0 = all semantic. Defaults to "0.5" if not specified or empty. Serialized as a string because CRDs do not support float types portably. pattern `^([0-9]*[.])?[0-9]+$`
`maxToolsToReturn`	`integer`	MaxToolsToReturn is the maximum number of tool results returned by a search query. Defaults to 8 if not specified or zero. min `1` · max `50`
`semanticDistanceThreshold`	`string`	SemanticDistanceThreshold is the maximum distance for semantic search results. Results exceeding this threshold are filtered out from semantic search. This threshold does not apply to keyword search. Range: 0 = identical, 2 = completely unrelated. Defaults to "1.0" if not specified or empty. Serialized as a string because CRDs do not support float types portably. pattern `^([0-9]*[.])?[0-9]+$`

Kubernetes: EmbeddingServer is required for the default TEI provider

When using the Kubernetes operator with the default tei embedding provider, even if you set hybridSearchSemanticRatio to "0.0" (all keyword search), the optimizer still requires a configured EmbeddingServer. The EmbeddingServer won't be used at runtime when the semantic ratio is 0.0, but the configuration must be present due to how the operator wires the resources internally.

This restriction doesn't apply when you set optimizer.embeddingService directly, such as with the OpenAI-compatible provider; the operator only requires embeddingServerRef when no manual embedding service is configured.

This restriction also does not apply to local CLI mode. thv vmcp serve --optimizer runs keyword-only search with no EmbeddingServer and no container.

Tuning guidance

The defaults are well-tested and work for most use cases. If you do need to adjust them:

Lower semanticDistanceThreshold (for example, "0.6") for higher precision: only very close matches are returned
Raise semanticDistanceThreshold (for example, "1.4") for higher recall: broader matches are included
Increase maxToolsToReturn if the AI frequently cannot find the right tool; decrease it to save tokens
Adjust hybridSearchSemanticRatio toward "1.0" if tool names are not descriptive, or toward "0.0" if exact keyword matching is more useful
semanticDistanceThreshold filtering is applied before the maxToolsToReturn cap. A low threshold can filter out candidates before the cap takes effect, so you may need to raise the threshold if too few results are returned

Complete example

This example shows a full configuration with all available options, including high availability for the embedding server, persistent model caching, and tuned optimizer parameters.

The EmbeddingServer runs two replicas with resource limits and a persistent volume for model caching, so restarts don't re-download the model:

embedding-server-full.yaml
apiVersion: toolhive.stacklok.dev/v1beta1
kind: EmbeddingServer
metadata:
  name: full-embedding
  namespace: toolhive-system
spec:
  replicas: 2
  resources:
    requests:
      cpu: '500m'
      memory: '512Mi'
    limits:
      cpu: '2'
      memory: '1Gi'
  modelCache:
    enabled: true
    size: 5Gi

The VirtualMCPServer uses a shorter embedding timeout (15s) because the EmbeddingServer is co-located with low-latency access. Increase this value if the embedding service is remote or under high load:

vmcp-with-optimizer.yaml
apiVersion: toolhive.stacklok.dev/v1beta1
kind: VirtualMCPServer
metadata:
  name: full-vmcp
  namespace: toolhive-system
spec:
  groupRef:
    name: my-tools
  embeddingServerRef:
    name: full-embedding
  groupRef:
    name: my-tools
  config:
    optimizer:
      embeddingServiceTimeout: 15s
      maxToolsToReturn: 10
      hybridSearchSemanticRatio: '0.6'
      semanticDistanceThreshold: '0.8'
  incomingAuth:
    type: oidc
    oidcConfigRef:
      name: my-oidc
      audience: vmcp-example

Next steps

Run tool calls server-side with code mode to collapse multi-tool workflows into one round-trip; code mode composes with the optimizer
Configure failure handling for circuit breakers and partial failure modes
Monitor vMCP activity with OpenTelemetry tracing and metrics

MCP Optimizer tutorial - end-to-end Kubernetes setup
Optimizing LLM context - background on tool filtering and context pollution
Configure vMCP servers
Run tool calls server-side with code mode
EmbeddingServer CRD specification
Virtual MCP Server overview - conceptual overview of vMCP
VirtualMCPServer CRD specification

Quick start (Kubernetes)​

Step 1: Create an EmbeddingServer​

Step 2: Add the embedding reference to VirtualMCPServer​

Step 3: Verify​

EmbeddingServer resource​

Use an OpenAI-compatible embedding service​

Local mode (CLI)​

Tier 1 — keyword search​

Tier 2 — managed TEI container​

Tier 3 — external embedding service​

Benefits​

How it works​

Tune the optimizer​

Parameter reference​

Complete example​

Next steps​

Related information​