Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/toolhive/concepts/tool-optimization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -257,3 +257,4 @@ how to configure them:
- [Customize tools (Kubernetes)](../guides-k8s/customize-tools.mdx)
- [MCPToolConfig CRD reference](../reference/crd-spec.md)
- [Virtual MCP Server tool aggregation](../guides-vmcp/tool-aggregation.mdx)
- [Optimize tool discovery in vMCP](../guides-vmcp/optimizer.mdx)
3 changes: 3 additions & 0 deletions docs/toolhive/concepts/vmcp.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ vMCP delivers four key benefits:
3. **Improve security**: Centralized authentication and authorization with a
two-boundary model
4. **Enable reusability**: Define workflows once, use them everywhere
5. **Optimize tool discovery**: Reduce token usage by replacing all tool
definitions with two lightweight search-and-call primitives

## Key capabilities

Expand Down Expand Up @@ -165,4 +167,5 @@ teams managing multiple MCP servers.
- [Configure authentication](../guides-vmcp/authentication.mdx)
- [Tool aggregation and conflict resolution](../guides-vmcp/tool-aggregation.mdx)
- [Composite tools and workflows](../guides-vmcp/composite-tools.mdx)
- [Optimize tool discovery](../guides-vmcp/optimizer.mdx)
- [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx)
3 changes: 3 additions & 0 deletions docs/toolhive/guides-vmcp/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,8 @@ Backend discovery guide.

## Next steps

- [Optimize tool discovery](./optimizer.mdx) by adding an `embeddingServerRef`
to reduce token usage across many backends
- Review [scaling and performance guidance](./scaling-and-performance.mdx) for
resource planning
- Discover your deployed MCP servers automatically using the
Expand All @@ -292,6 +294,7 @@ Backend discovery guide.
- [Scaling and Performance](./scaling-and-performance.mdx)
- [Backend discovery modes](./backend-discovery.mdx)
- [Tool aggregation](./tool-aggregation.mdx)
- [Optimize tool discovery](./optimizer.mdx)
- [Composite tools](./composite-tools.mdx)
- [Authentication](./authentication.mdx)
- [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx)
18 changes: 18 additions & 0 deletions docs/toolhive/guides-vmcp/intro.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ for details on the current limitations.
- **Centralized authentication**: Single sign-on with per-backend token exchange
- **Composite workflows**: Multi-step operations across backend MCP servers with
parallel execution, approval gates, and error handling
- **Tool optimization**: Replace all individual tool definitions with two
lightweight primitives (`find_tool` and `call_tool`) to reduce token usage and
improve tool selection. See [Optimize tool discovery](./optimizer.mdx) and the
underlying [concepts](../concepts/tool-optimization.mdx)

## When to use vMCP

Expand All @@ -46,6 +50,8 @@ for details on the current limitations.
- You have centralized authentication and authorization requirements
- You need reusable workflow definitions
- You want to aggregate external SaaS MCP servers with internal tools
- You want to reduce token usage and improve tool selection accuracy across many
backends with the [optimizer](./optimizer.mdx)

### Not needed

Expand Down Expand Up @@ -87,9 +93,21 @@ flowchart TB
5. Clients connect to the VirtualMCPServer endpoint and see a unified view of
all tools from both local and remote backends

## Optimize tool discovery

As the number of aggregated backends grows, clients receive a large number of
tool definitions that consume tokens and can degrade tool selection accuracy.
The vMCP optimizer addresses this by replacing all individual tool definitions
with two lightweight primitives (`find_tool` and `call_tool`) and using hybrid
semantic and keyword search to surface only the most relevant tools per request.
To enable the optimizer, add an `embeddingServerRef` to your VirtualMCPServer
resource. See [Optimize tool discovery](./optimizer.mdx) for the full setup
guide.

## Related information

- [Quickstart: Virtual MCP Server](./quickstart.mdx)
- [Understanding Virtual MCP Server](../concepts/vmcp.mdx)
- [Optimize tool discovery](./optimizer.mdx)
- [Scaling and Performance](./scaling-and-performance.mdx)
- [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx)
317 changes: 317 additions & 0 deletions docs/toolhive/guides-vmcp/optimizer.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,317 @@
---
title: Optimize tool discovery
description:
Enable the optimizer in vMCP to reduce token usage and improve tool selection
across aggregated backends.
---

When Virtual MCP Server (vMCP) aggregates many backend MCP servers, the total
number of tools exposed to clients can grow quickly. The optimizer addresses
this by filtering tools per request, reducing token usage and improving tool
selection accuracy.

For the desktop/CLI approach using the MCP Optimizer container, see the
[MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx). This guide covers the
Kubernetes operator approach using VirtualMCPServer and EmbeddingServer CRDs.

## Benefits

- **Reduced token usage**: Only relevant tools are included in context, not the
entire toolset
- **Improved tool selection**: The right tools surface for each query. With
fewer tools to reason over, agents are more likely to choose correctly

## How it works

1. You send a prompt that requires tool assistance
2. The AI calls `find_tool` with keywords extracted from the prompt
3. vMCP performs hybrid semantic and keyword search across all backend tools
4. Only the most relevant tools (up to 8 by default) are returned
5. The AI calls `call_tool` to execute the selected tool, and vMCP routes the
request to the appropriate backend

```mermaid
flowchart TB
subgraph vmcpGroup["VirtualMCPServer"]
direction TB
vmcp["vMCP (optimizer enabled)"]
end
subgraph embedding["EmbeddingServer"]
direction TB
tei["Text Embeddings Inference"]
end
subgraph backends["MCPGroup backends"]
direction TB
mcp1["MCP server"]
mcp2["MCP server"]
mcp3["MCP server"]
end

client(["Client"]) <-- "find_tool / call_tool" --> vmcpGroup
vmcp <-. "semantic search" .-> embedding
vmcp <-. "discovers / routes" .-> backends
```

:::info[How search works internally]

The optimizer uses an internal SQLite database for both keyword search (using
full-text search) and storing semantic vectors. Keyword search runs locally
against this database; semantic search uses vectors generated by an embedding
server. You can control how results from these two sources are blended — see the
[parameter reference](#parameter-reference) for details.

:::

## Quick start

### Step 1: Create an EmbeddingServer

Create an EmbeddingServer with default settings. This deploys a text embeddings
inference (TEI) server using the `BAAI/bge-small-en-v1.5` model:

```yaml title="embedding-server.yaml"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
name: my-embedding
namespace: toolhive-system
spec: {}
```

:::tip

Wait for the EmbeddingServer to reach the `Running` phase before proceeding. The
first startup may take a few minutes while the model downloads.

```bash
kubectl get embeddingserver my-embedding -n toolhive-system -w
```

:::

### Step 2: Add the embedding reference to VirtualMCPServer

Update your existing VirtualMCPServer to include `embeddingServerRef`. **This is
the only change needed to enable the optimizer.** When you set
`embeddingServerRef`, the operator automatically enables the optimizer with
sensible defaults. You only need to add an explicit `optimizer` block if you
want to [tune the parameters](#tune-the-optimizer).

```yaml title="VirtualMCPServer resource"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: my-vmcp
namespace: toolhive-system
spec:
# highlight-start
embeddingServerRef:
name: my-embedding
# highlight-end
config:
groupRef: my-group
incomingAuth:
type: anonymous
```

### Step 3: Verify

Check that the VirtualMCPServer is ready:

```bash
kubectl get virtualmcpserver my-vmcp -n toolhive-system
```

Look for `READY: True` in the output. Once ready, clients connecting to the vMCP
endpoint see only `find_tool` and `call_tool` instead of the full backend
toolset.

## EmbeddingServer resource

The EmbeddingServer CRD manages the lifecycle of a TEI server. An empty
`spec: {}` uses all defaults. The two most important fields you can customize
are:

- **`model`**: The Hugging Face embedding model to use. The default
(`BAAI/bge-small-en-v1.5`) is the tested and recommended model. You can
substitute any embedding model available on Hugging Face — see the
[MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to compare
options.
- **`image`**: The container image for
[text-embeddings-inference](https://github.com/huggingface/text-embeddings-inference)
(TEI). The default is the CPU-only image
(`ghcr.io/huggingface/text-embeddings-inference:cpu-latest`). Swap this for a
CUDA-enabled image if you have GPU nodes available.

For the complete field reference, see the
[EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver).

:::warning[ARM64 compatibility]

The default TEI CPU images depend on Intel MKL, which is x86_64-only. No
official ARM64 images exist yet. On ARM64 nodes (including Apple Silicon with
kind), you can run the amd64 image under emulation as a workaround.

First, pull the amd64 image and load it into your cluster:

```bash
docker pull --platform linux/amd64 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.7
kind load docker-image \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.7
```

The `kind load` command is specific to kind. For other cluster distributions,
use the equivalent image-loading mechanism (for example, `ctr images import` for
containerd, or push the image to a registry your cluster can pull from).

Then, pin the image in your EmbeddingServer so the operator uses the pre-pulled
tag instead of the default `cpu-latest`:

```yaml title="embedding-server.yaml"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
name: my-embedding
namespace: toolhive-system
spec:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.7
```

Native ARM64 support is in progress upstream. Track the
[TEI GitHub repository](https://github.com/huggingface/text-embeddings-inference)
for updates.

:::

## Tune the optimizer

To customize optimizer behavior, add the `optimizer` block under `spec.config`
in your VirtualMCPServer resource:

```yaml title="VirtualMCPServer resource"
spec:
config:
groupRef: my-group
# highlight-start
optimizer:
embeddingServiceTimeout: 30s
maxToolsToReturn: 8
hybridSearchSemanticRatio: '0.5'
semanticDistanceThreshold: '1.0'
# highlight-end
```

### Parameter reference

| Parameter | Description | Default |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `embeddingServiceTimeout` | HTTP request timeout for calls to the embedding service | `30s` |
| `maxToolsToReturn` | Maximum number of tools returned per search (1-50) | `8` |
| `hybridSearchSemanticRatio` | Balance between semantic and keyword search. `0.0` = all keyword, `1.0` = all semantic. Default gives equal weight to both. | `"0.5"` |
| `semanticDistanceThreshold` | Maximum distance from the search term for semantic results. `0` = identical, `2` = completely unrelated. Results beyond this threshold are filtered out. | `"1.0"` |

:::note

`hybridSearchSemanticRatio` and `semanticDistanceThreshold` are string-encoded
floats (for example, `"0.5"` not `0.5`). This is a Kubernetes CRD limitation, as
CRDs do not support float types portably.

:::

:::info[EmbeddingServer is always required]

Even if you set `hybridSearchSemanticRatio` to `"0.0"` (all keyword search), the
optimizer still requires a configured EmbeddingServer. The EmbeddingServer won't
be used at runtime when the semantic ratio is `0.0`, but the configuration must
be present due to how the optimizer is wired internally.

:::

:::tip[Tuning guidance]

The defaults are well-tested and work for most use cases. If you do need to
adjust them:

- **Lower `semanticDistanceThreshold`** (for example, `"0.6"`) for higher
precision: only very close matches are returned
- **Raise `semanticDistanceThreshold`** (for example, `"1.4"`) for higher
recall: broader matches are included
- **Increase `maxToolsToReturn`** if the AI frequently cannot find the right
tool; decrease it to save tokens
- **Adjust `hybridSearchSemanticRatio`** toward `"1.0"` if tool names are not
descriptive, or toward `"0.0"` if exact keyword matching is more useful
- `semanticDistanceThreshold` filtering is applied before the `maxToolsToReturn`
cap. A low threshold can filter out candidates before the cap takes effect, so
you may need to raise the threshold if too few results are returned

:::

## Complete example

This example shows a full configuration with all available options, including
high availability for the embedding server, persistent model caching, and tuned
optimizer parameters.

The EmbeddingServer runs two replicas with resource limits and a persistent
volume for model caching, so restarts don't re-download the model:

```yaml title="embedding-server-full.yaml"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
name: full-embedding
namespace: toolhive-system
spec:
replicas: 2
resources:
requests:
cpu: '500m'
memory: '512Mi'
limits:
cpu: '2'
memory: '1Gi'
modelCache:
enabled: true
storageSize: 5Gi
```

The VirtualMCPServer uses a shorter embedding timeout (15s) because the
EmbeddingServer is co-located with low-latency access. Increase this value if
the embedding service is remote or under high load:

```yaml title="vmcp-with-optimizer.yaml"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: full-vmcp
namespace: toolhive-system
spec:
embeddingServerRef:
name: full-embedding
config:
groupRef: my-tools
optimizer:
embeddingServiceTimeout: 15s
maxToolsToReturn: 10
hybridSearchSemanticRatio: '0.6'
semanticDistanceThreshold: '0.8'
incomingAuth:
type: oidc
oidcConfig:
type: inline
inline:
issuer: https://auth.example.com
audience: vmcp-example
```

## Related information

- [MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx) — desktop/CLI setup
- [Optimizing LLM context](../concepts/tool-optimization.mdx) — background on
tool filtering and context pollution
- [Configure vMCP servers](./configuration.mdx)
- [EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver)
- [Virtual MCP Server overview](../concepts/vmcp.mdx) — conceptual overview of
vMCP
- [VirtualMCPServer CRD specification](../reference/crd-spec.md#apiv1alpha1virtualmcpserver)
Loading