Production Stack

This project provides a reference implementation on how to build an inference stack on top of Kaito.

Architecture

Istio Gateway — Entry point for all inference requests. Routes client requests (e.g., GET /completions) through the stack.
Body-based Routing — Parses request body to extract the model name and injects the x-gateway-model-name header, enabling model-level routing.
GAIE EPP (Gateway API Inference Extension Endpoint Picker) — Performs KV-cache aware routing by injecting the x-gateway-destination-endpoint header, directing requests to the optimal inference pod.
Kaito InferenceSet — Manages groups of vLLM inference pods. Multiple InferenceSets (e.g., Model-A, Model-B) can run different models simultaneously.
vLLM Inference Pods — Serve model inference requests using vLLM.
Kaito-Keda-Scaler — Metric-based autoscaler built on KEDA that scales vLLM inference pods up and down based on workload metrics.
Mocked GPU Nodes / CPU Nodes — Infrastructure layer providing compute resources for inference workloads.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
charts/gpu-node-mocker		charts/gpu-node-mocker
cmd/gpu-node-mocker		cmd/gpu-node-mocker
docker		docker
docs/imgs		docs/imgs
hack		hack
pkg/gpu-node-mocker		pkg/gpu-node-mocker
test/e2e		test/e2e
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum