vllm-proxy-rs

Rust proxy for vLLM/sglang inference engines running in GPU TEE environments. Adds Intel TDX + NVIDIA GPU attestation and cryptographic signing (ECDSA secp256k1 + Ed25519) to standard OpenAI-compatible API endpoints.

Rewrite of nearai/vllm-proxy (Python).

Features

Dual signing — every response is signed with both ECDSA (EIP-191, secp256k1) and Ed25519. Signatures are cached and retrievable per chat ID.
TEE attestation — generates Intel TDX quotes via dstack-sdk and NVIDIA GPU evidence via Python subprocess.
Backend-agnostic — works with any OpenAI-compatible backend (vLLM, sglang, etc.).
Streaming support — SSE streams are hashed incrementally and signed on completion.
In-memory cache — moka-based TTL cache for signatures (no Redis dependency).
Fusion orchestration — optional server-side multi-model deliberation for /v1/chat/completions, gated by FUSION_ENABLED.

Endpoints

Method	Path	Auth	Description
GET	`/`	No	Health check
GET	`/version`	No	Proxy version
GET	`/v1/metrics`	No	Backend metrics passthrough
GET	`/v1/models`	No	Backend models passthrough
POST	`/v1/chat/completions`	Yes	Chat completions (streaming + non-streaming)
POST	`/v1/completions`	Yes	Text completions
POST	`/v1/embeddings`	Yes	Embeddings
POST	`/v1/tokenize`	Yes	Tokenization (no signing)
POST	`/v1/rerank`	Yes	Reranking
POST	`/v1/score`	Yes	Scoring
POST	`/v1/images/generations`	Yes	Image generation
POST	`/v1/images/edits`	Yes	Image editing (multipart)
POST	`/v1/audio/transcriptions`	Yes	Audio transcription (multipart)
GET	`/v1/signature/{chat_id}`	Yes	Retrieve cached signature
GET	`/v1/attestation/report`	Yes	TEE attestation report

Error Handling

All error responses use the OpenAI-compatible JSON format:

{"error": {"message": "...", "type": "...", "param": null, "code": null}}

Proxy-generated errors

Status	Type	When
400	`bad_request`	Invalid JSON, bad parameters
401	`unauthorized`	Invalid or missing Bearer token
404	`not_found`	Signature chat ID not found
413	`payload_too_large`	Request body exceeds size limit
429	`rate_limited`	Per-IP rate limit exceeded
500	`server_error`	Internal proxy error (details hidden from client)

Upstream errors (vLLM/sglang)

Named routes (/v1/chat/completions, /v1/completions, etc.) pass through the backend error body verbatim, preserving the original status code. The catch-all route (arbitrary paths) parses the backend error and re-wraps it in the OpenAI format above.

Common upstream errors:

Status	Type	Example message
400	`BadRequestError`	`"This model's maximum context length is 2048 tokens. However, you requested 4374 tokens"`
400	`BadRequestError`	`"temperature must be non-negative, got -0.5"`
400	`BadRequestError`	`"Stream options can only be defined when 'stream=True'"`
400	`BadRequestError`	`"please provide at least one prompt"`
400	`BadRequestError`	`"auto tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set"`
404	`Not Found`	`"The model 'gpt-5' does not exist"`
422	`Bad Request`	Pydantic validation details (field type mismatches)
500	`InternalServerError`	`"Internal server error"` (GPU OOM, engine crash)
501	`NotImplementedError`	`"Tool usage is only supported for Chat Completions API"`

Logging and privacy

All upstream errors are logged with structured fields for diagnostics:

WARN request{request_id=abc-123 method=POST path=/v1/chat/completions}:
  Backend returned non-success status
  upstream_status=400 upstream_url=http://vllm:8000/v1/chat/completions
  error_message="This model's maximum context length is 2048 tokens..."
  error_type=BadRequestError

What is logged: HTTP status codes, backend URLs, error messages (token counts, parameter names), error types, request IDs.

What is never logged: Request bodies, response bodies, prompt content, user messages, completion text.

Configuration

All configuration is via environment variables:

Variable	Required	Default	Description
`MODEL_NAME`	Yes	—	Model name for cache key namespacing
`TOKEN`	Yes	—	Bearer token for API authentication
`VLLM_BASE_URL`	No	`http://localhost:8000`	Backend base URL
`DEV`	No	`false`	Dev mode (random signing keys instead of KMS)
`GPU_NO_HW_MODE`	No	`false`	Use canned GPU evidence
`CHAT_CACHE_EXPIRATION`	No	`1200`	Signature cache TTL in seconds
`VLLM_PROXY_MAX_REQUEST_SIZE`	No	`10485760`	Max JSON request body (bytes)
`VLLM_PROXY_MAX_IMAGE_REQUEST_SIZE`	No	`52428800`	Max image request body (bytes)
`VLLM_PROXY_MAX_AUDIO_REQUEST_SIZE`	No	`104857600`	Max audio request body (bytes)
`VLLM_PROXY_IMAGE_VALIDATION_DISABLED`	No	`false`	Disable pre-dispatch image URL/data validation
`VLLM_PROXY_IMAGE_VALIDATION_TIMEOUT_SECS`	No	`5`	Per-fetch timeout and semaphore acquire deadline for validation
`VLLM_PROXY_IMAGE_VALIDATION_MAX_BYTES`	No	`8192`	Max fetched/decode-head bytes used for image sniffing
`VLLM_PROXY_IMAGE_VALIDATION_MAX_CONCURRENCY`	No	`8`	Global concurrent outbound image-validation fetches
`VLLM_PROXY_IMAGE_VALIDATION_ALLOW_PRIVATE_HOSTS`	No	`false`	Permit private/loopback image hosts for trusted deployments/tests
`VLLM_PROXY_IMAGE_VALIDATION_ALLOWED_DOMAINS`	No	empty; Gemma-4 defaults to `prod-files-secure.s3.us-west-2.amazonaws.com`	Exact remote `image_url` host allowlist enforced before fetch and on every redirect hop. When unset, falls back to `VLLM_ALLOWED_MEDIA_DOMAINS`; set explicitly to an empty string to disable the proxy-side domain restriction
`VLLM_ALLOWED_MEDIA_DOMAINS`	No	empty	vLLM-compatible media-domain allowlist used by the proxy only when `VLLM_PROXY_IMAGE_VALIDATION_ALLOWED_DOMAINS` is unset
`VLLM_PROXY_IMAGE_VALIDATION_REJECT_NON_RGB`	No	`false` (`1` forces strict mode)	Gemma-4 defaults to rejecting observed one-channel PNG/JPEG crash inputs; set `1` to reject broader non-RGB PNG/JPEG classes
`VLLM_PROXY_MAX_KEEPALIVE`	No	`100`	Connection pool max idle per host
`LISTEN_PORT`	No	`8000`	Server listen port
`VLLM_IMAGES_URL`	No	`{base}/v1/images/generations`	Override images endpoint
`VLLM_IMAGES_EDITS_URL`	No	`{base}/v1/images/edits`	Override image edits endpoint
`VLLM_TRANSCRIPTIONS_URL`	No	`{base}/v1/audio/transcriptions`	Override transcriptions endpoint
`VLLM_RERANK_URL`	No	`{base}/v1/rerank`	Override rerank endpoint
`VLLM_SCORE_URL`	No	`{base}/v1/score`	Override score endpoint

Fusion

Fusion is disabled by default. When enabled, /v1/chat/completions intercepts OpenRouter-compatible openrouter:fusion server tools, NEAR nearai:fusion tools, and OpenRouter plugin entries with {"id":"fusion"}; all other routes and non-Fusion chat requests keep the normal proxy behavior. Cloud API remains a pass-through: billing observes the single final response, whose usage contains the aggregate token usage from panel, judge, and synthesis calls.

Supported request shapes:

{"tools":[{"type":"openrouter:fusion","parameters":{"analysis_models":["model-a"],"model":"judge-model"}}]}

{"plugins":[{"id":"fusion","analysis_models":["model-a"],"model":"judge-model"}]}

Legacy flat tool fields continue to work for nearai:fusion and existing clients. plugins[].enabled=false is not treated as a Fusion invocation. The openrouter/fusion model alias is not resolved in inference-proxy because cloud-api routes by model name before pass-through.

Variable	Required	Default	Description
`FUSION_ENABLED`	No	`false`	Enables server-side Fusion orchestration
`FUSION_INTERNAL_BEARER_TOKEN`	Yes, when enabled	—	Bearer token used for internal direct completions calls
`FUSION_ENDPOINTS_URL`	No	`https://completions.near.ai/endpoints`	Endpoint discovery source
`FUSION_ENDPOINTS_TTL_SECS`	No	`300`	Discovery cache TTL
`FUSION_DEFAULT_ANALYSIS_MODELS`	No	—	Comma-separated fallback panel models
`FUSION_MAX_PANEL_MODELS`	No	`8`	Hard cap on panel fan-out
`FUSION_MAX_DEPTH`	No	`1`	Recursion guard for Fusion-to-Fusion calls
`FUSION_PANEL_TIMEOUT_SECS`	No	`120`	Timeout for Fusion panel, judge, and synthesis chat calls
`FUSION_MAX_RESPONSE_BYTES`	No	`10485760`	Max bytes buffered from Fusion endpoint discovery and internal model responses
`FUSION_INTERNAL_MAX_ATTEMPTS`	No	`2`	Attempts for transient Fusion direct model HTTP calls; `1` disables retries; max `5`
`FUSION_INTERNAL_RETRY_INITIAL_BACKOFF_MS`	No	`250`	Initial backoff for Fusion direct model retries; doubles per attempt with full jitter
`AGENT_LOOP_MAX_ITERATIONS`	No	`5`	Also caps Fusion `web_context_search` tool calls
`WEB_CONTEXT_SEARCH_URL`	If Fusion web search is used	—	Brave LLM Context endpoint
`WEB_CONTEXT_SEARCH_API_KEY`	If Fusion web search is used	—	Brave LLM Context API key
`BRAVE_LLM_CONTEXT_API_KEY`	No	—	Alias for `WEB_CONTEXT_SEARCH_API_KEY`

Production launch checklist:

Keep FUSION_ENABLED=false until the direct completions token and Brave key are provisioned in the deployment secret store.
Set FUSION_INTERNAL_BEARER_TOKEN to a token accepted by the direct model proxies listed by FUSION_ENDPOINTS_URL.
Treat FUSION_ENDPOINTS_URL as a trust anchor. Every returned panel or judge domain receives FUSION_INTERNAL_BEARER_TOKEN; do not point it at user-controlled endpoint lists or allow per-request endpoint overrides.
V1 does not SSRF-filter discovered panel domains beyond trusting FUSION_ENDPOINTS_URL. Keep the endpoint list operator-controlled and use network policy if a deployment needs additional egress restrictions.
V1 attestation covers the synthesis proxy response. Panel attestation is an informational /attestation/report liveness check and is not cryptographically bound into the final response signature.
For Fusion requests that include {"type":"web_context_search"}, set WEB_CONTEXT_SEARCH_URL=https://api.search.brave.com/res/v1/llm/context and either WEB_CONTEXT_SEARCH_API_KEY or BRAVE_LLM_CONTEXT_API_KEY.
Run the local/live smoke test before flipping the feature flag:

FUSION_INTERNAL_BEARER_TOKEN=... \
WEB_CONTEXT_SEARCH_API_KEY=... \
scripts/fusion_e2e.py --real-brave

The smoke test starts only local helper processes, calls live direct model proxies, verifies non-streaming and streaming Fusion, checks aggregate usage, and retrieves the final response signature. It does not deploy production.

Running

# Dev mode (random signing keys, no TEE required)
DEV=1 MODEL_NAME=my-model TOKEN=secret cargo run

# Production (requires dstack TEE environment)
MODEL_NAME=my-model TOKEN=secret cargo run --release

The server listens on 0.0.0.0:8000 by default (configurable via LISTEN_PORT).

Building

cargo build --release

Testing

cargo test

The suite includes unit tests for signing, cache, config, errors, attestation helpers, SSE parsing, Fusion orchestration, and integration tests with wiremock mock backends for cryptographic signature verification, multipart endpoints, streaming completions, E2EE, web tools, and Fusion.

Project Structure

src/
  lib.rs              # Public module exports, AppState, request ID middleware
  main.rs             # Entry point, server startup, graceful shutdown
  config.rs           # Env var configuration
  error.rs            # AppError -> OpenAI-style JSON error responses
  types.rs            # SignedChat, AttestationReport, SignatureResponse
  signing.rs          # ECDSA (secp256k1) + Ed25519 signing, key derivation
  attestation.rs      # TDX + GPU attestation report generation
  cache.rs            # moka in-memory cache with TTL
  proxy.rs            # Generic proxy helpers (JSON, streaming SSE, multipart)
  auth.rs             # Bearer token auth extractor
  routes/
    mod.rs            # Router assembly
    health.rs         # GET /, GET /version
    chat.rs           # POST /v1/chat/completions
    completions.rs    # POST /v1/completions
    passthrough.rs    # embeddings, rerank, score, images, audio, tokenize
    signature.rs      # GET /v1/signature/{chat_id}
    attestation.rs    # GET /v1/attestation/report
    metrics.rs        # GET /v1/metrics, GET /v1/models
tests/
  integration.rs      # Integration tests with wiremock

Signing

Every signed response produces a SignedChat cached by response ID:

{
  "text": "{sha256_request}:{sha256_response}",
  "signature_ecdsa": "0x{r}{s}{v}",
  "signing_address_ecdsa": "0x{ethereum_address}",
  "signature_ed25519": "{hex_signature}",
  "signing_address_ed25519": "{hex_public_key}"
}

ECDSA: EIP-191 personal_sign format, recoverable secp256k1 signature
Ed25519: Direct message signing, 64-byte signature

In production, signing keys are derived from dstack KMS (DstackClient::get_key). In dev mode (DEV=1), random keys are generated at startup.

Name		Name	Last commit message	Last commit date
Latest commit History 310 Commits
.cargo		.cargo
.github		.github
benches		benches
docs		docs
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
attestation-constraints.txt		attestation-constraints.txt
build-image.sh		build-image.sh
deny.toml		deny.toml
gpu_evidence_worker.py		gpu_evidence_worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vllm-proxy-rs

Features

Endpoints

Error Handling

Proxy-generated errors

Upstream errors (vLLM/sglang)

Logging and privacy

Configuration

Fusion

Running

Building

Testing

Project Structure

Signing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

vllm-proxy-rs

Features

Endpoints

Error Handling

Proxy-generated errors

Upstream errors (vLLM/sglang)

Logging and privacy

Configuration

Fusion

Running

Building

Testing

Project Structure

Signing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages