stacklok · aponcedeleonch · Mar 9, 2026 · Mar 6, 2026 · Mar 9, 2026
diff --git a/docs/toolhive/concepts/tool-optimization.mdx b/docs/toolhive/concepts/tool-optimization.mdx
@@ -257,3 +257,4 @@ how to configure them:
 - [Customize tools (Kubernetes)](../guides-k8s/customize-tools.mdx)
 - [MCPToolConfig CRD reference](../reference/crd-spec.md)
 - [Virtual MCP Server tool aggregation](../guides-vmcp/tool-aggregation.mdx)
+- [Optimize tool discovery in vMCP](../guides-vmcp/optimizer.mdx)
diff --git a/docs/toolhive/concepts/vmcp.mdx b/docs/toolhive/concepts/vmcp.mdx
@@ -31,6 +31,8 @@ vMCP delivers four key benefits:
 3. **Improve security**: Centralized authentication and authorization with a
    two-boundary model
 4. **Enable reusability**: Define workflows once, use them everywhere
+5. **Optimize tool discovery**: Reduce token usage by replacing all tool
+   definitions with two lightweight search-and-call primitives
 
 ## Key capabilities
 
@@ -165,4 +167,5 @@ teams managing multiple MCP servers.
 - [Configure authentication](../guides-vmcp/authentication.mdx)
 - [Tool aggregation and conflict resolution](../guides-vmcp/tool-aggregation.mdx)
 - [Composite tools and workflows](../guides-vmcp/composite-tools.mdx)
+- [Optimize tool discovery](../guides-vmcp/optimizer.mdx)
 - [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx)
diff --git a/docs/toolhive/guides-vmcp/configuration.mdx b/docs/toolhive/guides-vmcp/configuration.mdx
@@ -279,6 +279,8 @@ Backend discovery guide.
 
 ## Next steps
 
+- [Optimize tool discovery](./optimizer.mdx) by adding an `embeddingServerRef`
+  to reduce token usage across many backends
 - Review [scaling and performance guidance](./scaling-and-performance.mdx) for
   resource planning
 - Discover your deployed MCP servers automatically using the
@@ -292,6 +294,7 @@ Backend discovery guide.
 - [Scaling and Performance](./scaling-and-performance.mdx)
 - [Backend discovery modes](./backend-discovery.mdx)
 - [Tool aggregation](./tool-aggregation.mdx)
+- [Optimize tool discovery](./optimizer.mdx)
 - [Composite tools](./composite-tools.mdx)
 - [Authentication](./authentication.mdx)
 - [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx)
diff --git a/docs/toolhive/guides-vmcp/intro.mdx b/docs/toolhive/guides-vmcp/intro.mdx
@@ -36,6 +36,10 @@ for details on the current limitations.
 - **Centralized authentication**: Single sign-on with per-backend token exchange
 - **Composite workflows**: Multi-step operations across backend MCP servers with
   parallel execution, approval gates, and error handling
+- **Tool optimization**: Replace all individual tool definitions with two
+  lightweight primitives (`find_tool` and `call_tool`) to reduce token usage and
+  improve tool selection. See [Optimize tool discovery](./optimizer.mdx) and the
+  underlying [concepts](../concepts/tool-optimization.mdx)
 
 ## When to use vMCP
 
@@ -46,6 +50,8 @@ for details on the current limitations.
 - You have centralized authentication and authorization requirements
 - You need reusable workflow definitions
 - You want to aggregate external SaaS MCP servers with internal tools
+- You want to reduce token usage and improve tool selection accuracy across many
+  backends with the [optimizer](./optimizer.mdx)
 
 ### Not needed
 
@@ -87,9 +93,21 @@ flowchart TB
 5. Clients connect to the VirtualMCPServer endpoint and see a unified view of
    all tools from both local and remote backends
 
+## Optimize tool discovery
+
+As the number of aggregated backends grows, clients receive a large number of
+tool definitions that consume tokens and can degrade tool selection accuracy.
+The vMCP optimizer addresses this by replacing all individual tool definitions
+with two lightweight primitives (`find_tool` and `call_tool`) and using hybrid
+semantic and keyword search to surface only the most relevant tools per request.
+To enable the optimizer, add an `embeddingServerRef` to your VirtualMCPServer
+resource. See [Optimize tool discovery](./optimizer.mdx) for the full setup
+guide.
+
 ## Related information
 
 - [Quickstart: Virtual MCP Server](./quickstart.mdx)
 - [Understanding Virtual MCP Server](../concepts/vmcp.mdx)
+- [Optimize tool discovery](./optimizer.mdx)
 - [Scaling and Performance](./scaling-and-performance.mdx)
 - [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx)
diff --git a/docs/toolhive/guides-vmcp/optimizer.mdx b/docs/toolhive/guides-vmcp/optimizer.mdx
@@ -0,0 +1,317 @@
+---
+title: Optimize tool discovery
+description:
+  Enable the optimizer in vMCP to reduce token usage and improve tool selection
+  across aggregated backends.
+---
+
+When Virtual MCP Server (vMCP) aggregates many backend MCP servers, the total
+number of tools exposed to clients can grow quickly. The optimizer addresses
+this by filtering tools per request, reducing token usage and improving tool
+selection accuracy.
+
+For the desktop/CLI approach using the MCP Optimizer container, see the
+[MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx). This guide covers the
+Kubernetes operator approach using VirtualMCPServer and EmbeddingServer CRDs.
+
+## Benefits
+
+- **Reduced token usage**: Only relevant tools are included in context, not the
+  entire toolset
+- **Improved tool selection**: The right tools surface for each query. With
+  fewer tools to reason over, agents are more likely to choose correctly
+
+## How it works
+
+1. You send a prompt that requires tool assistance
+2. The AI calls `find_tool` with keywords extracted from the prompt
+3. vMCP performs hybrid semantic and keyword search across all backend tools
+4. Only the most relevant tools (up to 8 by default) are returned
+5. The AI calls `call_tool` to execute the selected tool, and vMCP routes the
+   request to the appropriate backend
+
+```mermaid
+flowchart TB
+  subgraph vmcpGroup["VirtualMCPServer"]
+    direction TB
+    vmcp["vMCP (optimizer enabled)"]
+  end
+  subgraph embedding["EmbeddingServer"]
+    direction TB
+    tei["Text Embeddings Inference"]
+  end
+  subgraph backends["MCPGroup backends"]
+    direction TB
+    mcp1["MCP server"]
+    mcp2["MCP server"]
+    mcp3["MCP server"]
+  end
+
+  client(["Client"]) <-- "find_tool / call_tool" --> vmcpGroup
+  vmcp <-. "semantic search" .-> embedding
+  vmcp <-. "discovers / routes" .-> backends
+```
+
+:::info[How search works internally]
+
+The optimizer uses an internal SQLite database for both keyword search (using
+full-text search) and storing semantic vectors. Keyword search runs locally
+against this database; semantic search uses vectors generated by an embedding
+server. You can control how results from these two sources are blended — see the
+[parameter reference](#parameter-reference) for details.
+
+:::
+
+## Quick start
+
+### Step 1: Create an EmbeddingServer
+
+Create an EmbeddingServer with default settings. This deploys a text embeddings
+inference (TEI) server using the `BAAI/bge-small-en-v1.5` model:
+
+```yaml title="embedding-server.yaml"
+apiVersion: toolhive.stacklok.dev/v1alpha1
+kind: EmbeddingServer
+metadata:
+  name: my-embedding
+  namespace: toolhive-system
+spec: {}
+```
+
+:::tip
+
+Wait for the EmbeddingServer to reach the `Running` phase before proceeding. The
+first startup may take a few minutes while the model downloads.
+
+```bash
+kubectl get embeddingserver my-embedding -n toolhive-system -w
+```
+
+:::
+
+### Step 2: Add the embedding reference to VirtualMCPServer
+
+Update your existing VirtualMCPServer to include `embeddingServerRef`. **This is
+the only change needed to enable the optimizer.** When you set
+`embeddingServerRef`, the operator automatically enables the optimizer with
+sensible defaults. You only need to add an explicit `optimizer` block if you
+want to [tune the parameters](#tune-the-optimizer).
+
+```yaml title="VirtualMCPServer resource"
+apiVersion: toolhive.stacklok.dev/v1alpha1
+kind: VirtualMCPServer
+metadata:
+  name: my-vmcp
+  namespace: toolhive-system
+spec:
+  # highlight-start
+  embeddingServerRef:
+    name: my-embedding
+  # highlight-end
+  config:
+    groupRef: my-group
+  incomingAuth:
+    type: anonymous
+```
+
+### Step 3: Verify
+
+Check that the VirtualMCPServer is ready:
+
+```bash
+kubectl get virtualmcpserver my-vmcp -n toolhive-system
+```
+
+Look for `READY: True` in the output. Once ready, clients connecting to the vMCP
+endpoint see only `find_tool` and `call_tool` instead of the full backend
+toolset.
+
+## EmbeddingServer resource
+
+The EmbeddingServer CRD manages the lifecycle of a TEI server. An empty
+`spec: {}` uses all defaults. The two most important fields you can customize
+are:
+
+- **`model`**: The Hugging Face embedding model to use. The default
+  (`BAAI/bge-small-en-v1.5`) is the tested and recommended model. You can
+  substitute any embedding model available on Hugging Face — see the
+  [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to compare
+  options.
+- **`image`**: The container image for
+  [text-embeddings-inference](https://github.com/huggingface/text-embeddings-inference)
+  (TEI). The default is the CPU-only image
+  (`ghcr.io/huggingface/text-embeddings-inference:cpu-latest`). Swap this for a
+  CUDA-enabled image if you have GPU nodes available.
+
+For the complete field reference, see the
+[EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver).
+
+:::warning[ARM64 compatibility]
+
+The default TEI CPU images depend on Intel MKL, which is x86_64-only. No
+official ARM64 images exist yet. On ARM64 nodes (including Apple Silicon with
+kind), you can run the amd64 image under emulation as a workaround.
+
+First, pull the amd64 image and load it into your cluster:
+
+```bash
+docker pull --platform linux/amd64 \
+  ghcr.io/huggingface/text-embeddings-inference:cpu-1.7
+kind load docker-image \
+  ghcr.io/huggingface/text-embeddings-inference:cpu-1.7
+```
+
+The `kind load` command is specific to kind. For other cluster distributions,
+use the equivalent image-loading mechanism (for example, `ctr images import` for
+containerd, or push the image to a registry your cluster can pull from).
+
+Then, pin the image in your EmbeddingServer so the operator uses the pre-pulled
+tag instead of the default `cpu-latest`:
+
+```yaml title="embedding-server.yaml"
+apiVersion: toolhive.stacklok.dev/v1alpha1
+kind: EmbeddingServer
+metadata:
+  name: my-embedding
+  namespace: toolhive-system
+spec:
+  image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.7
+```
+
+Native ARM64 support is in progress upstream. Track the
+[TEI GitHub repository](https://github.com/huggingface/text-embeddings-inference)
+for updates.
+
+:::
+
+## Tune the optimizer
+
+To customize optimizer behavior, add the `optimizer` block under `spec.config`
+in your VirtualMCPServer resource:
+
+```yaml title="VirtualMCPServer resource"
+spec:
+  config:
+    groupRef: my-group
+    # highlight-start
+    optimizer:
+      embeddingServiceTimeout: 30s
+      maxToolsToReturn: 8
+      hybridSearchSemanticRatio: '0.5'
+      semanticDistanceThreshold: '1.0'
+    # highlight-end
+```
+
+### Parameter reference
+
+| Parameter                   | Description                                                                                                                                              | Default |
+| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| `embeddingServiceTimeout`   | HTTP request timeout for calls to the embedding service                                                                                                  | `30s`   |
+| `maxToolsToReturn`          | Maximum number of tools returned per search (1-50)                                                                                                       | `8`     |
+| `hybridSearchSemanticRatio` | Balance between semantic and keyword search. `0.0` = all keyword, `1.0` = all semantic. Default gives equal weight to both.                              | `"0.5"` |
+| `semanticDistanceThreshold` | Maximum distance from the search term for semantic results. `0` = identical, `2` = completely unrelated. Results beyond this threshold are filtered out. | `"1.0"` |
+
+:::note
+
+`hybridSearchSemanticRatio` and `semanticDistanceThreshold` are string-encoded
+floats (for example, `"0.5"` not `0.5`). This is a Kubernetes CRD limitation, as
+CRDs do not support float types portably.
+
+:::
+
+:::info[EmbeddingServer is always required]
+
+Even if you set `hybridSearchSemanticRatio` to `"0.0"` (all keyword search), the
+optimizer still requires a configured EmbeddingServer. The EmbeddingServer won't
+be used at runtime when the semantic ratio is `0.0`, but the configuration must
+be present due to how the optimizer is wired internally.
+
+:::
+
+:::tip[Tuning guidance]
+
+The defaults are well-tested and work for most use cases. If you do need to
+adjust them:
+
+- **Lower `semanticDistanceThreshold`** (for example, `"0.6"`) for higher
+  precision: only very close matches are returned
+- **Raise `semanticDistanceThreshold`** (for example, `"1.4"`) for higher
+  recall: broader matches are included
+- **Increase `maxToolsToReturn`** if the AI frequently cannot find the right
+  tool; decrease it to save tokens
+- **Adjust `hybridSearchSemanticRatio`** toward `"1.0"` if tool names are not
+  descriptive, or toward `"0.0"` if exact keyword matching is more useful
+- `semanticDistanceThreshold` filtering is applied before the `maxToolsToReturn`
+  cap. A low threshold can filter out candidates before the cap takes effect, so
+  you may need to raise the threshold if too few results are returned
+
+:::
+
+## Complete example
+
+This example shows a full configuration with all available options, including
+high availability for the embedding server, persistent model caching, and tuned
+optimizer parameters.
+
+The EmbeddingServer runs two replicas with resource limits and a persistent
+volume for model caching, so restarts don't re-download the model:
+
+```yaml title="embedding-server-full.yaml"
+apiVersion: toolhive.stacklok.dev/v1alpha1
+kind: EmbeddingServer
+metadata:
+  name: full-embedding
+  namespace: toolhive-system
+spec:
+  replicas: 2
+  resources:
+    requests:
+      cpu: '500m'
+      memory: '512Mi'
+    limits:
+      cpu: '2'
+      memory: '1Gi'
+  modelCache:
+    enabled: true
+    storageSize: 5Gi
+```
+
+The VirtualMCPServer uses a shorter embedding timeout (15s) because the
+EmbeddingServer is co-located with low-latency access. Increase this value if
+the embedding service is remote or under high load:
+
+```yaml title="vmcp-with-optimizer.yaml"
+apiVersion: toolhive.stacklok.dev/v1alpha1
+kind: VirtualMCPServer
+metadata:
+  name: full-vmcp
+  namespace: toolhive-system
+spec:
+  embeddingServerRef:
+    name: full-embedding
+  config:
+    groupRef: my-tools
+    optimizer:
+      embeddingServiceTimeout: 15s
+      maxToolsToReturn: 10
+      hybridSearchSemanticRatio: '0.6'
+      semanticDistanceThreshold: '0.8'
+  incomingAuth:
+    type: oidc
+    oidcConfig:
+      type: inline
+      inline:
+        issuer: https://auth.example.com
+        audience: vmcp-example
+```
+
+## Related information
+
+- [MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx) — desktop/CLI setup
+- [Optimizing LLM context](../concepts/tool-optimization.mdx) — background on
+  tool filtering and context pollution
+- [Configure vMCP servers](./configuration.mdx)
+- [EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver)
+- [Virtual MCP Server overview](../concepts/vmcp.mdx) — conceptual overview of
+  vMCP
+- [VirtualMCPServer CRD specification](../reference/crd-spec.md#apiv1alpha1virtualmcpserver)