This project provides a reference implementation on how to build an inference stack on top of Kaito.
- Istio Gateway — Entry point for all inference requests. Routes client requests (e.g.,
GET /completions) through the stack. - Body-based Routing — Parses request body to extract the model name and injects the
x-gateway-model-nameheader, enabling model-level routing. - GAIE EPP (Gateway API Inference Extension Endpoint Picker) — Performs KV-cache aware routing by injecting the
x-gateway-destination-endpointheader, directing requests to the optimal inference pod. - Kaito InferenceSet — Manages groups of vLLM inference pods. Multiple InferenceSets (e.g., Model-A, Model-B) can run different models simultaneously.
- vLLM Inference Pods — Serve model inference requests using vLLM.
- Kaito-Keda-Scaler — Metric-based autoscaler built on KEDA that scales vLLM inference pods up and down based on workload metrics.
- Mocked GPU Nodes / CPU Nodes — Infrastructure layer providing compute resources for inference workloads.
