Production-grade MCP tool retrieval and routing for large, dynamic tool ecosystems
VOTR is the system described in the paper [VOTR: Vector Orchestrated Tool Retrieval for Scalable Multi-Agent Systems](paper)`: a FastAPI service that retrieves and ranks MCP tools before model invocation, then returns a compact candidate set for schema injection. It is designed to preserve retrieval quality while reducing prompt overhead and supporting live MCP registry updates.
- Tool retrieval accuracy: Successfully pulls tools from a dataset of 309 servers, 2806 tools at 96.4% recall with smaller sets providing 100% recall.
- Paper-aligned system: Implements the VOTR retrieval stack and evaluation workflow from the manuscript.
- Hybrid retrieval core: Dense similarity + BM25 + SPLADE-lite fused with weighted Reciprocal Rank Fusion.
- Field-aware reranking: Structured overlap scoring across server/tool name, description, and parameter signals.
- Confidence-gated handoff: Dynamic (k \in {1,3,5}) selection calibrated from non-conformity style thresholds + optional abstention protocol.
- Registry built for live MCP: Runtime discovery and hot registration through stdio and HTTP/SSE pathways.
- Robustness features: Overlap-aware disambiguation, abstention/null-route guards, and regression-oriented suites.
- Token efficiency: Compact schema lines (near 99% reduction to standard corpus in general MCP use cases).
In large MCP deployments, injecting every tool schema is not viable. The paper motivates this with a 309-server / 2,806-tool setting, where full schema injection can exceed practical prompt budgets with high accuracy and latency. VOTR treats tool selection as a retrieval and ranking problem with uncertainty-aware candidate sizing, instead of static top-k injection.
This repository is the core VOTR router implementation.
src/mcp_router/: retrieval engine, reranking, confidence policy, registry, APIbenchmarks/: functional correctness, ablations, efficiency, confidence, robustnessevaluation/: reporting and external benchmark adapters (including LiveMCPBench tooling)scripts/: index/data preparation and result table generationdocs/: implementation notes and policy documentation
Companion integration loop (optional, separate repo):
VOTR-Orchestrator is used for E2E integration testing around the router. In practice, it acts as the execution harness that sends routed tool candidates into a production-style multi-step agent loop, then validates tool-calling behavior across full conversations and chained tasks.
Prebuilt embeddings derived and expanded from MCP-Zero embeddings and index artifacts are published here:
- Dataset/artifacts page: https://huggingface.co/datasets/a13awd/VOTR
Current published payload includes approximately 623 MB of data uploaded from the MCP-Router/data tree, covering precomputed routing artifacts for reproducibility.
Hosted artifact types include:
- Precomputed tool/server embedding shards (
.npy) - Index metadata (
meta.json, registry export, schema docs) - Benchmark-ready subset indexes (small/medium/full, LiveMCPBench variant)
- Versioned checksum manifest for reproducibility
- 3072 dimension embedding of 309 MCP servers + 2806 tools made from text-embedding-3-large (OpenAI) ≈ 340MB
- Python 3.10+
OPENAI_API_KEY(for query-time embedding in default configuration)
python -m pip install -e .Optional extras (for future development and integration):
python -m pip install -e ".[dev,eval]"
python -m pip install -e ".[qdrant]"python scripts/build_index.py \
--input "../MCP-Zero/MCP-tools/mcp_tools_with_embedding.json" \
--output data/indexSmall dev build:
python scripts/build_index.py \
--input "../MCP-Zero/MCP-tools/mcp_tools_with_embedding.json" \
--output data/index \
--max-servers 20PowerShell:
$env:OPENAI_API_KEY="sk-..."python -m uvicorn mcp_router.router:app --host 0.0.0.0 --port 8765Common endpoints:
POST /route- retrieve and rank candidate tools for a requestPOST /register- register a server/tools payload directlyPOST /register/discover- discover/register from stdio MCP serverPOST /register/discover/sse- discover/register from HTTP endpoint
Minimal POST /route body:
{
"server_intent": "GitHub repositories and API",
"tool_intent": "search repositories by query",
"session_id": "ss2013"
}Paper evaluation covers:
- Single-tool routing across small / medium / large suites
- Multi-hop and multi-tool scaled suites for small / medium / large suites (including long-hop stress runs)
- Ablations (dense-only, BM25-only, dense+BM25, fullstack)
- Confidence calibration and handoff behavior
- Latency and token-efficiency measurements
- Out-of-distribution stress test on LiveMCPBench
Runbook:
Common commands:
python benchmarks/functional_correctness/_run_all_suites.py
python benchmarks/efficiency/run_latency.py
python benchmarks/baselines_ablations/run_profiles.py
python benchmarks/efficiency/build_paper_comparison.py
python scripts/generate_results_tables.py- Use
config.yamlfor baseline settings; put machine-specific overrides inconfig.local.yaml. - Generated runtime artifacts and benchmark outputs are intentionally ignored by
.gitignore. - For paper-consistent runs, keep index build source/version and benchmark suite versions fixed.
- TODO : create citation