Skip to content

kaito-project/production-stack

Repository files navigation

Production Stack

This project provides a reference implementation on how to build an inference stack on top of Kaito.

Architecture

Production Stack Architecture

Components

  • Istio Gateway — Entry point for all inference requests. Routes client requests (e.g., GET /completions) through the stack.
  • Body-based Routing — Parses request body to extract the model name and injects the x-gateway-model-name header, enabling model-level routing.
  • GAIE EPP (Gateway API Inference Extension Endpoint Picker) — Performs KV-cache aware routing by injecting the x-gateway-destination-endpoint header, directing requests to the optimal inference pod.
  • Kaito InferenceSet — Manages groups of vLLM inference pods. Multiple InferenceSets (e.g., Model-A, Model-B) can run different models simultaneously.
  • vLLM Inference Pods — Serve model inference requests using vLLM.
  • Kaito-Keda-Scaler — Metric-based autoscaler built on KEDA that scales vLLM inference pods up and down based on workload metrics.
  • Mocked GPU Nodes / CPU Nodes — Infrastructure layer providing compute resources for inference workloads.

About

This project provides a reference implementation on how to build an inference stack on top of Kaito

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors