Operate shared GPU clusters, LLM training and serving workloads, developer environments, and data/model assets across research, education, and enterprise teams.
English · 简体中文
Documentation · Helm Chart · Backend · Frontend · CLI
🏢 Multi-Tenant Governance · ⚙️ Policy-Aware Scheduling · 🚀 LLM Training & Serving · 🧩 Heterogeneous Accelerators · 🤖 AI-Assisted Operations
📖 Table of Contents
Crater is a Kubernetes-native platform for operating shared AI computing clusters. It helps organizations manage heterogeneous compute resources, submit and govern AI workloads, quickly deploy large-model training and inference environments, and observe cluster health from a unified web console, CLI, and AI-assisted operations interface.
Crater is designed for environments where different teams and workloads share the same GPU cluster: long-running training jobs, bursty lab workloads, interactive notebooks, online AI services, LLM inference services, and offline data processing pipelines. It builds an operational control plane on top of Kubernetes and Volcano, connecting users, accounts, queues, quotas, images, datasets, models, jobs, services, and observability into one workflow.
Kubernetes and Volcano provide powerful low-level scheduling, but operating a shared GPU cluster for many teams still requires a lot of glue. Crater fills that gap:
| Without a control plane | With Crater |
|---|---|
Raw kubectl / YAML access, easy to misuse |
Web console, CLI, and APIs with role-based, multi-tenant access |
| GPU usage is hard to attribute and bound | Accounts, queues, quotas, approvals, and cost visibility |
| Everyone rebuilds training/serving manifests by hand | Reusable job templates and one-click LLM deployment |
| Datasets, models, and images scattered across nodes | Managed datasets, models, images, and shared storage |
| Operators and users debug from different tools | Unified metrics, logs, GPU analysis, and AI-assisted operations |
Crater fits shared AI computing environments in universities, research institutes, enterprise AI teams, and internal platform teams.
| Scenario | Typical workloads | What Crater provides |
|---|---|---|
| Research & engineering | Model fine-tuning, simulation, scientific computing, large experiments | Long-running GPU jobs, reusable environments, data/model mounting, logs, monitoring, and lifecycle controls |
| Teaching & training | Course labs, student projects, virtual experiments, workshops | Account and quota management, job templates, burst handling, fair access, and simple web-based submission |
| LLM training & serving | Fine-tuning, evaluation, inference endpoints, model demos, mixed training/serving clusters | Fast deployment templates, GPU-aware placement, data/model assets, service access, and train/serve resource governance |
| Enterprise AI services | Internal assistants, document intelligence, multimodal services, inference backends | Managed runtime environments, service access, operational visibility, and resource governance |
| Data processing | Dataset preparation, image analysis, batch pipelines, offline preprocessing | Storage integration, dataset/model management, schedulable batch jobs, and observability |
| Manage users, accounts, queues, quotas, approvals, and billing-oriented resource visibility. Crater turns a raw GPU cluster into an accountable shared service for teams and projects. | Build on Kubernetes and Volcano to support queue-based admission, priority-aware execution, prequeue policies, and workload placement across heterogeneous resources, including mixed training and serving workloads. |
| Submit, clone, monitor, stop, and inspect AI workloads through Kubernetes-native jobs and reusable templates, from interactive sessions and LLM fine-tuning to long-running batch jobs. | Launch containerized Jupyter, WebIDE, web terminals, SSH access, and custom environments without manual cluster setup, giving users a reproducible workspace close to the data and GPUs. |
| Organize datasets, models, shared files, custom images, registry entries, and platform-side model or dataset downloads so workloads can reuse managed artifacts. | Represent GPUs and accelerator models as schedulable resources, supporting NVIDIA GPUs, domestic accelerator cards, vGPU-style resources, and DRA/CDI-based device integration. |
| Troubleshoot with metrics, logs, Grafana dashboards, node status, operation logs, GPU analysis, and runtime inspection, reducing the gap between platform operators and workload owners. | Operate Crater through a web console, command-line interface, HTTP APIs, and agent-oriented command skills for automation, scripted workflows, and AI-assisted operations. |
| Support large-model quick deployment, LLM training and inference, inference gateways, model-serving integrations, trusted service integrations, and platform-managed runtime templates. | Deploy with Helm and integrate with Kubernetes, Volcano, Prometheus/Grafana, persistent storage, and cluster add-ons while keeping workloads portable. |
Crater is organized around four layers:
- User interfaces: web console, CLI, HTTP APIs, and agent-friendly command skills.
- Control plane: authentication, accounts, quotas, scheduling policies, jobs, services, templates, images, datasets, models, approvals, and operations.
- Execution layer: Kubernetes workloads, Volcano scheduling, accelerator resources, Pods, Services, PVCs, and external access rules for training, serving, and interactive environments.
- Observability and AI operations layer: metrics, logs, Grafana dashboards, operation records, runtime diagnostics, AI assistant workflows, and admin-side intelligent operations.
Pick the option that fits your scenario:
| Option | Best for | Reference |
|---|---|---|
| 🐳 Kind | Local clusters in Docker | kind.sigs.k8s.io |
| 🧱 Minikube | Single-node local dev & testing | minikube.sigs.k8s.io |
| ☁️ Production K8s | Production or large-scale deployments | kubernetes.io/docs/setup |
helm registry login ghcr.io
helm install crater oci://ghcr.io/raids-lab/crater --version <chart-version>💡 The chart version is in
charts/crater/Chart.yaml(fieldversion) or in the GitHub releases.
Deployment guides:
- 📄 Minimal Deployment (Kind) — quickly spin up a basic Crater
- 📄 Cluster Deployment Guide — deploy a full Crater on a cluster
- 📘 Admin guide (English): https://raids-lab.github.io/crater/en/docs/admin/
- 📗 Admin guide (中文): https://raids-lab.github.io/crater/zh/docs/admin/
| Path | Description |
|---|---|
backend/ |
Backend services |
frontend/ |
Web UI |
cli/ |
Command-line interface |
charts/ |
Helm charts for deploying Crater |
website/ |
Documentation website source |
grafana-dashboards/ |
Grafana dashboards used by Crater |
docs/ |
Documentation entrypoints and localization resources |
hack/ |
Developer tooling and scripts |
- 🐛 Issues — report bugs or request features: GitHub Issues
- 💡 Discussions — ask questions and share ideas: GitHub Discussions
- 📚 Docs — admin and user guides: raids-lab.github.io/crater
- ⭐ Star the project if you find Crater useful — it helps others discover it.
We welcome community contributions! The complete development and contribution specification lives in CONTRIBUTING.md: global rules, environment setup (fork, hooks, unified config), workflow, commit convention, PR description template, and per-module entry points.
Per-module specs:
- Backend — backend/CONTRIBUTING.md
- Frontend — frontend/CONTRIBUTING.md
- Website / Docs — website/CONTRIBUTING.md
- CLI — cli/CONTRIBUTING.md
Crater is licensed under the Apache License 2.0. See LICENSE.




