A lightweight, self-hosted LLM Proxy with Smart Features
Unified API for cloud and local LLMs with cost tracking, intelligent routing, semantic caching, web augmentation, and document RAG.
A single self-hosted proxy that puts all your LLM providers behind one API:
- One endpoint, all models — Claude, GPT, Gemini, Llama, and 700+ others
- Accurate cost tracking — Token-level tracking with cache and reasoning tokens
- Flexible attribution — Tag requests by user, project, or team
- Works with any client — Ollama and OpenAI API compatible
- Smart Features — Intelligent routing, semantic caching, web search augmentation, and document RAG
git clone https://github.com/benhumphry/llm-relay.git
cd llm-relay
cp .env.example .env
# Add your API keys to .env
docker compose up -dThat's it. Your proxy is running:
- API: http://localhost:11434
- Admin UI: http://localhost:8080
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="my-team")
# Use any provider through one client
client.chat.completions.create(model="claude-sonnet", messages=[...])
client.chat.completions.create(model="gpt-4o", messages=[...])
client.chat.completions.create(model="gemini-2.5-pro", messages=[...])Or with curl:
curl http://localhost:11434/api/chat \
-d '{"model": "claude-sonnet", "messages": [{"role": "user", "content": "Hello!"}]}'Works with Open WebUI, Cursor, Continue, and any Ollama or OpenAI-compatible client.
15 built-in providers, 700+ models:
| Provider | Set Environment Variable |
|---|---|
| Anthropic | ANTHROPIC_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Google Gemini | GOOGLE_API_KEY |
| DeepSeek | DEEPSEEK_API_KEY |
| Mistral | MISTRAL_API_KEY |
| Groq | GROQ_API_KEY |
| xAI | XAI_API_KEY |
| Perplexity | PERPLEXITY_API_KEY |
| OpenRouter | OPENROUTER_API_KEY |
| Fireworks | FIREWORKS_API_KEY |
| Together AI | TOGETHER_API_KEY |
| DeepInfra | DEEPINFRA_API_KEY |
| Cerebras | CEREBRAS_API_KEY |
| SambaNova | SAMBANOVA_API_KEY |
| Cohere | COHERE_API_KEY |
Plus: connect local Ollama instances and add custom OpenAI-compatible providers through the Admin UI.
Every request is logged with input tokens, output tokens, reasoning tokens, cache tokens, and calculated cost. View breakdowns by provider, model, tag, or client in the Admin UI.
Pricing syncs from LiteLLM and handles provider quirks automatically (Gemini tiered pricing, Perplexity per-request fees, Anthropic cache multipliers, etc).
Attribute costs to users, projects, or teams:
# Via bearer token
curl -H "Authorization: Bearer alice,project-x" ...
# Via header
curl -H "X-Proxy-Tag: alice,project-x" ...
# Via model suffix
curl -d '{"model": "claude-sonnet@alice"}' ...Create friendly names for models:
| Alias | Target |
|---|---|
claude |
anthropic/claude-sonnet-4-20250514 |
gpt |
openai/gpt-4o |
fast |
groq/llama-3.3-70b-versatile |
Transparent model name mappings with wildcard support. Unlike aliases, redirects are checked first in resolution and don't appear in logs:
| Source Pattern | Target |
|---|---|
gpt-4 |
gpt-4o |
openrouter/anthropic/* |
anthropic/* |
Use cases: seamless model upgrades, provider switching without client changes.
Let an LLM pick the best model for each request. Configure candidate models, and a fast designator model routes requests based on query content.
Model Intelligence (optional): Enable web-gathered comparative assessments so the designator knows each model's relative strengths and weaknesses. The system searches for model reviews and direct comparisons, then summarizes into actionable routing guidance.
Semantic response caching using ChromaDB. Returns cached responses for semantically similar queries, reducing token usage and costs.
- Configurable similarity threshold (default 95%)
- TTL-based expiration
- Token length filters (skip caching short responses)
- Option to match only last message (ignores conversation history)
Requires ChromaDB (CHROMA_URL environment variable).
Context augmentation via web search and URL scraping. Every request is automatically augmented:
- A designator LLM generates an optimized search query
- Web search is performed (SearXNG, Perplexity, or Jina)
- URLs are reranked by relevance using cross-encoder
- Top results are scraped for full content (built-in or Jina Reader)
- Combined context is injected into the request
Requires a search provider (SearXNG, Perplexity, or Jina).
Document-based context augmentation using RAG (Retrieval-Augmented Generation). Index local document folders and automatically retrieve relevant context for each query.
- Multiple formats — PDF, DOCX, PPTX, HTML, Markdown, images (via Docling)
- Flexible embeddings — Local (bundled), Ollama, or any configured provider
- Vision model offloading — Offload PDF parsing to Ollama or cloud vision models (e.g., granite3.2-vision)
- Semantic search — ChromaDB vector storage with configurable similarity threshold
- Cross-encoder reranking — Improves retrieval quality with always-on reranking
- Scheduled indexing — Cron-based re-indexing for updated documents
Mount your document folders into the container, create a Smart RAG pointing to the path, and requests to that model name automatically include relevant document context.
Requires ChromaDB (CHROMA_URL environment variable).
Clean web interface for:
- Provider and model management
- Ollama instance management (pull/delete models)
- Usage analytics with charts and filters
- Settings, pricing sync, data export
services:
llm-relay:
image: ghcr.io/benhumphry/llm-relay:latest
ports:
- "11434:11434"
- "8080:8080"
volumes:
- llm-relay-data:/data
env_file:
- .env
volumes:
llm-relay-data:For production deployments, PostgreSQL is recommended. For Smart Cache, Smart Augmentor, and Model Intelligence features, ChromaDB is required.
See INSTALLATION.md for:
- PostgreSQL setup
- ChromaDB integration (vector storage for smart features)
- SearXNG integration (web search for Smart Augmentor)
- Full environment variable reference
- Docker Swarm deployment
- Troubleshooting guide
| Variable | Default | Description |
|---|---|---|
PORT |
11434 | API server port |
ADMIN_PORT |
8080 | Admin UI port |
ADMIN_PASSWORD |
(random) | Admin UI password |
DATABASE_URL |
SQLite | PostgreSQL URL for production |
CHROMA_URL |
(none) | ChromaDB URL (enables Smart Cache, Model Intelligence) |
SEARXNG_URL |
(none) | SearXNG URL (enables Smart Augmentor search) |
JINA_API_KEY |
(none) | Jina API key (enables Jina Search, Reranker) |
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python proxy.pyGET /api/tags— List modelsPOST /api/chat— Chat completionPOST /api/generate— Text generation
GET /v1/models— List modelsPOST /v1/chat/completions— Chat completion
- Getting Started - First-time setup walkthrough
- Smart Routers - Intelligent model routing
- Smart Caches - Semantic response caching
- Smart Augmentors - Web search augmentation
- Smart RAGs - Document RAG setup
MIT



