-
Notifications
You must be signed in to change notification settings - Fork 2
Add vMCP optimizer guide for Kubernetes #588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,317 @@ | ||
| --- | ||
| title: Optimize tool discovery | ||
| description: | ||
| Enable the optimizer in vMCP to reduce token usage and improve tool selection | ||
| across aggregated backends. | ||
| --- | ||
|
|
||
| When Virtual MCP Server (vMCP) aggregates many backend MCP servers, the total | ||
| number of tools exposed to clients can grow quickly. The optimizer addresses | ||
| this by filtering tools per request, reducing token usage and improving tool | ||
| selection accuracy. | ||
|
|
||
| For the desktop/CLI approach using the MCP Optimizer container, see the | ||
| [MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx). This guide covers the | ||
| Kubernetes operator approach using VirtualMCPServer and EmbeddingServer CRDs. | ||
|
|
||
| ## Benefits | ||
|
|
||
| - **Reduced token usage**: Only relevant tools are included in context, not the | ||
| entire toolset | ||
| - **Improved tool selection**: The right tools surface for each query. With | ||
| fewer tools to reason over, agents are more likely to choose correctly | ||
|
|
||
| ## How it works | ||
|
|
||
| 1. You send a prompt that requires tool assistance | ||
| 2. The AI calls `find_tool` with keywords extracted from the prompt | ||
| 3. vMCP performs hybrid semantic and keyword search across all backend tools | ||
| 4. Only the most relevant tools (up to 8 by default) are returned | ||
| 5. The AI calls `call_tool` to execute the selected tool, and vMCP routes the | ||
| request to the appropriate backend | ||
|
|
||
| ```mermaid | ||
| flowchart TB | ||
| subgraph vmcpGroup["VirtualMCPServer"] | ||
| direction TB | ||
| vmcp["vMCP (optimizer enabled)"] | ||
| end | ||
| subgraph embedding["EmbeddingServer"] | ||
| direction TB | ||
| tei["Text Embeddings Inference"] | ||
| end | ||
| subgraph backends["MCPGroup backends"] | ||
| direction TB | ||
| mcp1["MCP server"] | ||
| mcp2["MCP server"] | ||
| mcp3["MCP server"] | ||
| end | ||
|
|
||
| client(["Client"]) <-- "find_tool / call_tool" --> vmcpGroup | ||
| vmcp <-. "semantic search" .-> embedding | ||
| vmcp <-. "discovers / routes" .-> backends | ||
| ``` | ||
|
|
||
| :::info[How search works internally] | ||
|
|
||
| The optimizer uses an internal SQLite database for both keyword search (using | ||
| full-text search) and storing semantic vectors. Keyword search runs locally | ||
| against this database; semantic search uses vectors generated by an embedding | ||
| server. You can control how results from these two sources are blended — see the | ||
| [parameter reference](#parameter-reference) for details. | ||
|
|
||
| ::: | ||
|
|
||
| ## Quick start | ||
|
|
||
| ### Step 1: Create an EmbeddingServer | ||
|
|
||
| Create an EmbeddingServer with default settings. This deploys a text embeddings | ||
| inference (TEI) server using the `BAAI/bge-small-en-v1.5` model: | ||
|
|
||
| ```yaml title="embedding-server.yaml" | ||
| apiVersion: toolhive.stacklok.dev/v1alpha1 | ||
| kind: EmbeddingServer | ||
| metadata: | ||
| name: my-embedding | ||
| namespace: toolhive-system | ||
| spec: {} | ||
| ``` | ||
|
|
||
| :::tip | ||
|
|
||
| Wait for the EmbeddingServer to reach the `Running` phase before proceeding. The | ||
| first startup may take a few minutes while the model downloads. | ||
|
|
||
| ```bash | ||
| kubectl get embeddingserver my-embedding -n toolhive-system -w | ||
| ``` | ||
|
|
||
| ::: | ||
|
|
||
| ### Step 2: Add the embedding reference to VirtualMCPServer | ||
|
|
||
| Update your existing VirtualMCPServer to include `embeddingServerRef`. **This is | ||
| the only change needed to enable the optimizer.** When you set | ||
| `embeddingServerRef`, the operator automatically enables the optimizer with | ||
| sensible defaults. You only need to add an explicit `optimizer` block if you | ||
| want to [tune the parameters](#tune-the-optimizer). | ||
|
|
||
| ```yaml title="VirtualMCPServer resource" | ||
| apiVersion: toolhive.stacklok.dev/v1alpha1 | ||
| kind: VirtualMCPServer | ||
| metadata: | ||
| name: my-vmcp | ||
| namespace: toolhive-system | ||
| spec: | ||
| # highlight-start | ||
| embeddingServerRef: | ||
| name: my-embedding | ||
| # highlight-end | ||
| config: | ||
| groupRef: my-group | ||
| incomingAuth: | ||
| type: anonymous | ||
| ``` | ||
|
|
||
| ### Step 3: Verify | ||
|
|
||
| Check that the VirtualMCPServer is ready: | ||
|
|
||
| ```bash | ||
| kubectl get virtualmcpserver my-vmcp -n toolhive-system | ||
| ``` | ||
|
|
||
| Look for `READY: True` in the output. Once ready, clients connecting to the vMCP | ||
| endpoint see only `find_tool` and `call_tool` instead of the full backend | ||
| toolset. | ||
|
|
||
| ## EmbeddingServer resource | ||
|
|
||
| The EmbeddingServer CRD manages the lifecycle of a TEI server. An empty | ||
| `spec: {}` uses all defaults. The two most important fields you can customize | ||
| are: | ||
|
|
||
| - **`model`**: The Hugging Face embedding model to use. The default | ||
| (`BAAI/bge-small-en-v1.5`) is the tested and recommended model. You can | ||
| substitute any embedding model available on Hugging Face — see the | ||
| [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to compare | ||
| options. | ||
| - **`image`**: The container image for | ||
| [text-embeddings-inference](https://github.com/huggingface/text-embeddings-inference) | ||
| (TEI). The default is the CPU-only image | ||
| (`ghcr.io/huggingface/text-embeddings-inference:cpu-latest`). Swap this for a | ||
| CUDA-enabled image if you have GPU nodes available. | ||
|
|
||
| For the complete field reference, see the | ||
| [EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver). | ||
|
|
||
| :::warning[ARM64 compatibility] | ||
|
|
||
| The default TEI CPU images depend on Intel MKL, which is x86_64-only. No | ||
| official ARM64 images exist yet. On ARM64 nodes (including Apple Silicon with | ||
| kind), you can run the amd64 image under emulation as a workaround. | ||
|
|
||
| First, pull the amd64 image and load it into your cluster: | ||
|
|
||
| ```bash | ||
| docker pull --platform linux/amd64 \ | ||
| ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 | ||
| kind load docker-image \ | ||
| ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 | ||
| ``` | ||
|
|
||
| The `kind load` command is specific to kind. For other cluster distributions, | ||
| use the equivalent image-loading mechanism (for example, `ctr images import` for | ||
| containerd, or push the image to a registry your cluster can pull from). | ||
|
|
||
| Then, pin the image in your EmbeddingServer so the operator uses the pre-pulled | ||
| tag instead of the default `cpu-latest`: | ||
|
|
||
| ```yaml title="embedding-server.yaml" | ||
| apiVersion: toolhive.stacklok.dev/v1alpha1 | ||
| kind: EmbeddingServer | ||
| metadata: | ||
| name: my-embedding | ||
| namespace: toolhive-system | ||
| spec: | ||
| image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 | ||
| ``` | ||
|
|
||
| Native ARM64 support is in progress upstream. Track the | ||
| [TEI GitHub repository](https://github.com/huggingface/text-embeddings-inference) | ||
| for updates. | ||
|
|
||
| ::: | ||
|
|
||
| ## Tune the optimizer | ||
|
|
||
| To customize optimizer behavior, add the `optimizer` block under `spec.config` | ||
| in your VirtualMCPServer resource: | ||
|
|
||
| ```yaml title="VirtualMCPServer resource" | ||
| spec: | ||
| config: | ||
| groupRef: my-group | ||
| # highlight-start | ||
| optimizer: | ||
| embeddingServiceTimeout: 30s | ||
| maxToolsToReturn: 8 | ||
| hybridSearchSemanticRatio: '0.5' | ||
| semanticDistanceThreshold: '1.0' | ||
| # highlight-end | ||
| ``` | ||
|
|
||
| ### Parameter reference | ||
|
|
||
| | Parameter | Description | Default | | ||
| | --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | ||
| | `embeddingServiceTimeout` | HTTP request timeout for calls to the embedding service | `30s` | | ||
| | `maxToolsToReturn` | Maximum number of tools returned per search (1-50) | `8` | | ||
| | `hybridSearchSemanticRatio` | Balance between semantic and keyword search. `0.0` = all keyword, `1.0` = all semantic. Default gives equal weight to both. | `"0.5"` | | ||
| | `semanticDistanceThreshold` | Maximum distance from the search term for semantic results. `0` = identical, `2` = completely unrelated. Results beyond this threshold are filtered out. | `"1.0"` | | ||
|
|
||
| :::note | ||
|
|
||
| `hybridSearchSemanticRatio` and `semanticDistanceThreshold` are string-encoded | ||
| floats (for example, `"0.5"` not `0.5`). This is a Kubernetes CRD limitation, as | ||
| CRDs do not support float types portably. | ||
|
|
||
| ::: | ||
|
|
||
| :::info[EmbeddingServer is always required] | ||
|
|
||
| Even if you set `hybridSearchSemanticRatio` to `"0.0"` (all keyword search), the | ||
| optimizer still requires a configured EmbeddingServer. The EmbeddingServer won't | ||
| be used at runtime when the semantic ratio is `0.0`, but the configuration must | ||
| be present due to how the optimizer is wired internally. | ||
|
|
||
| ::: | ||
|
|
||
| :::tip[Tuning guidance] | ||
|
|
||
| The defaults are well-tested and work for most use cases. If you do need to | ||
| adjust them: | ||
|
|
||
| - **Lower `semanticDistanceThreshold`** (for example, `"0.6"`) for higher | ||
| precision: only very close matches are returned | ||
| - **Raise `semanticDistanceThreshold`** (for example, `"1.4"`) for higher | ||
| recall: broader matches are included | ||
| - **Increase `maxToolsToReturn`** if the AI frequently cannot find the right | ||
aponcedeleonch marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| tool; decrease it to save tokens | ||
| - **Adjust `hybridSearchSemanticRatio`** toward `"1.0"` if tool names are not | ||
| descriptive, or toward `"0.0"` if exact keyword matching is more useful | ||
| - `semanticDistanceThreshold` filtering is applied before the `maxToolsToReturn` | ||
| cap. A low threshold can filter out candidates before the cap takes effect, so | ||
| you may need to raise the threshold if too few results are returned | ||
|
|
||
| ::: | ||
|
|
||
| ## Complete example | ||
|
|
||
| This example shows a full configuration with all available options, including | ||
| high availability for the embedding server, persistent model caching, and tuned | ||
| optimizer parameters. | ||
|
|
||
| The EmbeddingServer runs two replicas with resource limits and a persistent | ||
| volume for model caching, so restarts don't re-download the model: | ||
|
|
||
| ```yaml title="embedding-server-full.yaml" | ||
| apiVersion: toolhive.stacklok.dev/v1alpha1 | ||
| kind: EmbeddingServer | ||
| metadata: | ||
| name: full-embedding | ||
| namespace: toolhive-system | ||
| spec: | ||
| replicas: 2 | ||
| resources: | ||
| requests: | ||
| cpu: '500m' | ||
| memory: '512Mi' | ||
| limits: | ||
| cpu: '2' | ||
| memory: '1Gi' | ||
| modelCache: | ||
| enabled: true | ||
| storageSize: 5Gi | ||
| ``` | ||
|
|
||
| The VirtualMCPServer uses a shorter embedding timeout (15s) because the | ||
| EmbeddingServer is co-located with low-latency access. Increase this value if | ||
| the embedding service is remote or under high load: | ||
|
|
||
| ```yaml title="vmcp-with-optimizer.yaml" | ||
| apiVersion: toolhive.stacklok.dev/v1alpha1 | ||
| kind: VirtualMCPServer | ||
| metadata: | ||
| name: full-vmcp | ||
| namespace: toolhive-system | ||
| spec: | ||
| embeddingServerRef: | ||
| name: full-embedding | ||
| config: | ||
| groupRef: my-tools | ||
| optimizer: | ||
| embeddingServiceTimeout: 15s | ||
| maxToolsToReturn: 10 | ||
| hybridSearchSemanticRatio: '0.6' | ||
| semanticDistanceThreshold: '0.8' | ||
| incomingAuth: | ||
| type: oidc | ||
| oidcConfig: | ||
| type: inline | ||
| inline: | ||
| issuer: https://auth.example.com | ||
| audience: vmcp-example | ||
| ``` | ||
|
|
||
| ## Related information | ||
|
|
||
| - [MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx) — desktop/CLI setup | ||
| - [Optimizing LLM context](../concepts/tool-optimization.mdx) — background on | ||
| tool filtering and context pollution | ||
| - [Configure vMCP servers](./configuration.mdx) | ||
| - [EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver) | ||
| - [Virtual MCP Server overview](../concepts/vmcp.mdx) — conceptual overview of | ||
| vMCP | ||
| - [VirtualMCPServer CRD specification](../reference/crd-spec.md#apiv1alpha1virtualmcpserver) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.