This document provides a detailed description of the InferMesh architecture, its components, and the interfaces used across the system.
InferMesh is a GPU- and network-aware mesh that provides routing, observability, and control for both inference and training at scale.
It is designed for heterogeneous and distributed environments where nodes may differ in GPU capability, interconnect type, network latency, and bandwidth.
The architecture is organized into two hierarchical layers with three cross-cutting planes:
-
Infrastructure Layer (Underlay)
Provides the distributed mesh fabric: gossip, consensus, topology service, secure networking, node identity, and lifecycle management. -
Workload Overlay Layer
Provides connectors and integrations for inference runtimes, training frameworks, and GPUs. Hosts routers, schedulers, adapters, and the open strategy interface for routing and topology decisions. -
Planes
- Communication/Data Plane: inference traffic and training collective operations.
- Signal Plane: runtime, GPU, and network telemetry flowing into fused state.
- Control Plane: policies, job specs, placements, quotas, events, and administration.
flowchart TB
classDef underlay fill:#eef6ff,stroke:#84a9ff,stroke-width:1px,color:#0b2e66;
classDef overlay fill:#eefcf4,stroke:#8fd19e,stroke-width:1px,color:#145c2e;
classDef edge fill:#f7ecff,stroke:#c69cf2,stroke-width:1px,color:#442266;
classDef ext fill:#fff7e6,stroke:#f0b429,stroke-width:1px,color:#6a3d00;
%% Overlay Layer
subgraph Overlay["Workload Overlay Layer"]
direction TB
R["Inference Router
HTTP/2 gRPC, HTTP, WS/SSE"]:::overlay
SCHED["Training Scheduler
Job API (gRPC/HTTP)"]:::overlay
STRAT["Strategy Host (Inference + Training)
plugins: cdylib/wasm"]:::overlay
subgraph Adapters["Adapters & Connectors"]
direction LR
RTAD["Runtime Adapter
Triton/vLLM/TGI/TorchServe/TF/OVMS"]:::overlay
COMMAD["Comm Adapter
NCCL/UCC telemetry"]:::overlay
GPUAD["GPU Adapter
DCGM/NVML/ROCm-SMI"]:::overlay
IOAD["IO Adapter
Datasets/Checkpoint stores"]:::overlay
end
end
%% Infrastructure Layer
subgraph Underlay["Infrastructure Layer (Mesh Fabric)"]
AG["Mesh Agent (meshd)
state fusion + scoring API"]:::underlay
G["Membership & Gossip (SWIM)"]:::underlay
RF["Consensus (Raft groups)"]:::underlay
TOPO["Topology Service
link coords, RTT, WAN-awareness"]:::underlay
NET["Secure Networking
mTLS, QUIC, pools"]:::underlay
NM["Node Manager
identity, certs, drain/cordon"]:::underlay
POL["Policy Store
placements, quotas, SLOs"]:::underlay
end
%% Execution / External
subgraph Exec["Execution & External Systems"]
TRT["Inference Runtimes
Triton / vLLM / TGI / TorchServe / TF Serving"]:::edge
TRN["Training Runtimes
PyTorch DDP / DeepSpeed / Megatron"]:::edge
NCCL[("Collectives: NCCL/UCC/UCP")]:::edge
GPU["GPUs + MIG/MPS"]:::edge
DS[("Datasets / Feature Store")]:::ext
CKPT[("Checkpoint Store: S3/GCS/NFS")]:::ext
OBS[("Prometheus / Grafana")]:::ext
OTel[("OpenTelemetry Collector")]:::ext
end
%% Overlay <-> Underlay
R -->|ScoreTargets/Admit| AG
SCHED -->|Plan/Place/Elastic| AG
STRAT --> AG
RTAD -->|ModelStateDelta| AG
COMMAD -->|CommStatsDelta| AG
GPUAD -->|GpuStateDelta| AG
IOAD -->|Dataset/Checkpoint metrics| AG
AG --> G
AG --> RF
AG --> TOPO
AG --> POL
AG --> NM
AG --> NET
RTAD --> TRT
COMMAD --> NCCL
TRN --> NCCL
GPUAD --> GPU
IOAD --> DS
TRN --> CKPT
R -->|/metrics| OBS
SCHED -->|/metrics| OBS
AG -->|/metrics| OBS
RTAD -->|/metrics| OBS
COMMAD -->|/metrics| OBS
GPUAD -->|/metrics| OBS
R -->|OTLP traces| OTel
AG -->|OTLP traces| OTel
Every node runs a mesh agent (meshd). Nodes can be configured with specific roles:
-
Router Nodes
- Accept inference requests from clients.
- Invoke routing strategies through the Strategy Host.
- Apply admission control and hedged requests.
-
Trainer Nodes
- Run training runtimes (e.g., PyTorch DDP, DeepSpeed, Megatron).
- Surface collective timings and communication stats through the Comm Adapter.
- Use strategies to adjust topology (ring/tree/hierarchical) or elastic scale.
-
Coordinator Nodes
- Host the Training Scheduler or Router roles.
- Participate in Raft for consistent job and policy state.
-
GPU Nodes
- Run inference runtimes or participate in training.
- Always include Runtime + GPU Adapters for telemetry.
-
Edge Nodes
- Optional ingress close to users.
- Typically run Router + Agent roles to reduce WAN latency.
- Inference: Routers forward requests, apply routing decisions, cancel hedges, stream responses.
- Training: Trainers execute collective ops (all-reduce, all-gather); topology hints (ring/tree/hierarchical) may change mid-flight.
- Interfaces:
- Client ↔ Router: HTTP/1.1, HTTP/2 (gRPC), WebSockets.
- Router ↔ Runtime: gRPC/HTTP with mTLS.
- Scheduler ↔ Trainers: Job API with placement/topology hints.
- Standards: protobuf/OpenAPI schemas, W3C TraceContext.
- Components: Adapters (runtime, GPU, comm), mesh agent.
- Function: Gather metrics (queue depth, tokens/sec, gradient timings, GPU VRAM/SM%, ECC, link RTT). Fuse into local scoring APIs.
- Interfaces:
- Adapters ↔ Agent: gRPC streaming (
ModelStateDelta,GpuStateDelta,CommStatsDelta). - Metrics Export: Prometheus
/metrics, optional OTLP traces.
- Adapters ↔ Agent: gRPC streaming (
- Components: Agent, Scheduler, Router, Raft group.
- Function: Manage policies, job specs, quotas, tenancy, model pinning, elastic scaling.
- Interfaces:
- Admin ↔ Agent: gRPC/HTTP API with RBAC.
- Agent ↔ Agent: Gossip (SWIM) for membership; Raft for strongly consistent writes.
- Events: Server-streaming gRPC (
SubscribeEvents).
InferMesh provides a pluggable strategy interface for both inference routing and training topology decisions.
- Inputs: runtime stats, GPU telemetry, network topology, policies, SLOs, historical performance.
- Outputs: ranked targets (inference) or placement/topology plans (training).
- Built-ins:
baseline_rr,least_queue,hybrid-mesh(inference);ring,tree,hierarchical(training). - Custom strategies: users can provide their own via Rust plugins (
cdylib) or sandboxed WASM (wasm32-wasi).
- Unified interface: built-ins and user-defined strategies follow the same trait definitions.
- Safety: host enforces decision latency budgets (µs for inference, ms for training).
- Observability: decision latency, timeouts, win-rate vs baseline, hedge waste % all exported as metrics.
- Experimentation: A/B testing and shadow mode supported for new strategies.
| Interface | Protocol | Standards |
|---|---|---|
| Inference APIs | HTTP/2 gRPC, HTTP/1.1, WebSocket | protobuf, OpenAPI, W3C TraceContext |
| Training job control | gRPC + JSON | protobuf, OpenAPI |
| Collectives | In-proc + RDMA fabric | NCCL, UCC/UCP, SHARP, RoCE/IB, TCP |
| Runtime metrics | HTTP /metrics |
Prometheus, OpenMetrics |
| GPU metrics | HTTP /metrics |
Prometheus, DCGM schema |
| Tracing | OTLP/gRPC, OTLP/HTTP | OpenTelemetry, W3C TraceContext |
| Membership | Gossip (UDP/TCP) | SWIM-style protocol |
| Consensus | gRPC over TCP | Raft |
Label schema (for all metrics/traces):
model, revision, quant, runtime, node, gpu_uuid, mig_profile, job_id, step, bucket_id, tenant, zone
- mTLS everywhere (Router ↔ Agent ↔ GPU/Trainer nodes).
- Identity: X.509 certs (SPIFFE-compatible CA).
- RBAC: enforced on APIs using JWT/OIDC claims.
- Isolation: MIG/MPS enforced at GPU level; quotas and tenancy enforced in control plane.
- Prometheus: metrics from routers, schedulers, agents, adapters.
- OpenTelemetry: traces for inference requests and training steps/buckets.
- Logs: structured JSON with correlation IDs.
- Dashboards: Grafana dashboards covering inference latency/utilization and training comm efficiency.
- Backpressure: 429 or admission rejections when queues exceed thresholds.
- Quarantine: automatic node quarantine on ECC/thermal/network degradation.
- Partition tolerance: routers and schedulers degrade to local-only strategies when gossip fails.
- Consensus loss: existing configs persist; new writes blocked until quorum restored.
- v0.1: Local-only inference routing, mock training telemetry, Prometheus metrics.
- v0.2: Gossip membership, scoring API, strategy plugin host, Prometheus/OTel integration.
- v0.3: Raft policies, control-plane API, router hedging, training topology hints.
- v0.4: Runtime adapters (Triton, vLLM, TGI) and training comm adapters (NCCL/UCC).
- v1.0: Multi-region WAN-aware routing, MIG-aware scheduling, full strategy SDK (Rust + WASM).