Crater

A Kubernetes-native control plane for shared AI computing clusters

Operate shared GPU clusters, LLM training and serving workloads, developer environments, and data/model assets across research, education, and enterprise teams.

English · 简体中文

Documentation · Helm Chart · Backend · Frontend · CLI

🏢 Multi-Tenant Governance · ⚙️ Policy-Aware Scheduling · 🚀 LLM Training & Serving · 🧩 Heterogeneous Accelerators · 🤖 AI-Assisted Operations

📖 Table of Contents

Overview
Why Crater
Designed For
Features
Architecture
Getting Started
Documentation
Repository Structure
Community & Support
Contributing
License

✨ Overview

Crater is a Kubernetes-native platform for operating shared AI computing clusters. It helps organizations manage heterogeneous compute resources, submit and govern AI workloads, quickly deploy large-model training and inference environments, and observe cluster health from a unified web console, CLI, and AI-assisted operations interface.

Crater is designed for environments where different teams and workloads share the same GPU cluster: long-running training jobs, bursty lab workloads, interactive notebooks, online AI services, LLM inference services, and offline data processing pipelines. It builds an operational control plane on top of Kubernetes and Volcano, connecting users, accounts, queues, quotas, images, datasets, models, jobs, services, and observability into one workflow.


🧪 Interactive Development — Jupyter, WebIDE, terminals, and external access	🚀 AI Workloads — training, serving, templates, and batch jobs
📈 Monitoring — real-time metrics & logs	📦 Models & Datasets — manage assets in one place

💡 Why Crater

Kubernetes and Volcano provide powerful low-level scheduling, but operating a shared GPU cluster for many teams still requires a lot of glue. Crater fills that gap:

Without a control plane	With Crater
Raw `kubectl` / YAML access, easy to misuse	Web console, CLI, and APIs with role-based, multi-tenant access
GPU usage is hard to attribute and bound	Accounts, queues, quotas, approvals, and cost visibility
Everyone rebuilds training/serving manifests by hand	Reusable job templates and one-click LLM deployment
Datasets, models, and images scattered across nodes	Managed datasets, models, images, and shared storage
Operators and users debug from different tools	Unified metrics, logs, GPU analysis, and AI-assisted operations

🌐 Designed For

Crater fits shared AI computing environments in universities, research institutes, enterprise AI teams, and internal platform teams.

Scenario	Typical workloads	What Crater provides
Research & engineering	Model fine-tuning, simulation, scientific computing, large experiments	Long-running GPU jobs, reusable environments, data/model mounting, logs, monitoring, and lifecycle controls
Teaching & training	Course labs, student projects, virtual experiments, workshops	Account and quota management, job templates, burst handling, fair access, and simple web-based submission
LLM training & serving	Fine-tuning, evaluation, inference endpoints, model demos, mixed training/serving clusters	Fast deployment templates, GPU-aware placement, data/model assets, service access, and train/serve resource governance
Enterprise AI services	Internal assistants, document intelligence, multimodal services, inference backends	Managed runtime environments, service access, operational visibility, and resource governance
Data processing	Dataset preparation, image analysis, batch pipelines, offline preprocessing	Storage integration, dataset/model management, schedulable batch jobs, and observability

🎯 Features

🏢 Multi-Tenant Governance Manage users, accounts, queues, quotas, approvals, and billing-oriented resource visibility. Crater turns a raw GPU cluster into an accountable shared service for teams and projects.	⚙️ Policy-Aware Scheduling Build on Kubernetes and Volcano to support queue-based admission, priority-aware execution, prequeue policies, and workload placement across heterogeneous resources, including mixed training and serving workloads.
🚀 Workload Lifecycle Submit, clone, monitor, stop, and inspect AI workloads through Kubernetes-native jobs and reusable templates, from interactive sessions and LLM fine-tuning to long-running batch jobs.	🧪 Interactive Development Launch containerized Jupyter, WebIDE, web terminals, SSH access, and custom environments without manual cluster setup, giving users a reproducible workspace close to the data and GPUs.
📦 Data, Model & Image Assets Organize datasets, models, shared files, custom images, registry entries, and platform-side model or dataset downloads so workloads can reuse managed artifacts.	🧩 Heterogeneous Accelerators Represent GPUs and accelerator models as schedulable resources, supporting NVIDIA GPUs, domestic accelerator cards, vGPU-style resources, and DRA/CDI-based device integration.
📈 Observability & Operations Troubleshoot with metrics, logs, Grafana dashboards, node status, operation logs, GPU analysis, and runtime inspection, reducing the gap between platform operators and workload owners.	⌨️ Web, CLI & Agent Interfaces Operate Crater through a web console, command-line interface, HTTP APIs, and agent-oriented command skills for automation, scripted workflows, and AI-assisted operations.
🤖 LLM & AI Service Platform Support large-model quick deployment, LLM training and inference, inference gateways, model-serving integrations, trusted service integrations, and platform-managed runtime templates.	☸️ Kubernetes-Native Deployment Deploy with Helm and integrate with Kubernetes, Volcano, Prometheus/Grafana, persistent storage, and cluster add-ons while keeping workloads portable.

🏗️ Architecture

_{High-level architecture of Crater and its major components.}

Crater is organized around four layers:

User interfaces: web console, CLI, HTTP APIs, and agent-friendly command skills.
Control plane: authentication, accounts, quotas, scheduling policies, jobs, services, templates, images, datasets, models, approvals, and operations.
Execution layer: Kubernetes workloads, Volcano scheduling, accelerator resources, Pods, Services, PVCs, and external access rules for training, serving, and interactive environments.
Observability and AI operations layer: metrics, logs, Grafana dashboards, operation records, runtime diagnostics, AI assistant workflows, and admin-side intelligent operations.

🚀 Getting Started

1. Prerequisites

A running Kubernetes cluster
kubectl
Helm v3

2. Set up a cluster

Pick the option that fits your scenario:

Option	Best for	Reference
🐳 Kind	Local clusters in Docker	kind.sigs.k8s.io
🧱 Minikube	Single-node local dev & testing	minikube.sigs.k8s.io
☁️ Production K8s	Production or large-scale deployments	kubernetes.io/docs/setup

3. Install via Helm (OCI)

helm registry login ghcr.io
helm install crater oci://ghcr.io/raids-lab/crater --version <chart-version>

💡 The chart version is in charts/crater/Chart.yaml (field version) or in the GitHub releases.

Deployment guides:

📄 Minimal Deployment (Kind) — quickly spin up a basic Crater
📄 Cluster Deployment Guide — deploy a full Crater on a cluster

📚 Documentation

📘 Admin guide (English): https://raids-lab.github.io/crater/en/docs/admin/
📗 Admin guide (中文): https://raids-lab.github.io/crater/zh/docs/admin/

📁 Repository Structure

Path	Description
`backend/`	Backend services
`frontend/`	Web UI
`cli/`	Command-line interface
`charts/`	Helm charts for deploying Crater
`website/`	Documentation website source
`grafana-dashboards/`	Grafana dashboards used by Crater
`docs/`	Documentation entrypoints and localization resources
`hack/`	Developer tooling and scripts

💬 Community & Support

🐛 Issues — report bugs or request features: GitHub Issues
💡 Discussions — ask questions and share ideas: GitHub Discussions
📚 Docs — admin and user guides: raids-lab.github.io/crater
⭐ Star the project if you find Crater useful — it helps others discover it.

🤝 Contributing

We welcome community contributions! The complete development and contribution specification lives in CONTRIBUTING.md: global rules, environment setup (fork, hooks, unified config), workflow, commit convention, PR description template, and per-module entry points.

Per-module specs:

Backend — backend/CONTRIBUTING.md
Frontend — frontend/CONTRIBUTING.md
Website / Docs — website/CONTRIBUTING.md
CLI — cli/CONTRIBUTING.md

📝 License

Crater is licensed under the Apache License 2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crater

A Kubernetes-native control plane for shared AI computing clusters

✨ Overview

💡 Why Crater

🌐 Designed For

🎯 Features

🏢 Multi-Tenant Governance

⚙️ Policy-Aware Scheduling

🚀 Workload Lifecycle

🧪 Interactive Development

📦 Data, Model & Image Assets

🧩 Heterogeneous Accelerators

📈 Observability & Operations

⌨️ Web, CLI & Agent Interfaces

🤖 LLM & AI Service Platform

☸️ Kubernetes-Native Deployment

🏗️ Architecture

🚀 Getting Started

1. Prerequisites

2. Set up a cluster

3. Install via Helm (OCI)

📚 Documentation

📁 Repository Structure

💬 Community & Support

🤝 Contributing

📝 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2,710 Commits
.githook		.githook
.github		.github
.vscode		.vscode
backend		backend
charts		charts
cli		cli
docs		docs
frontend		frontend
grafana-dashboards		grafana-dashboards
hack		hack
skills		skills
website		website
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
package-lock.json		package-lock.json

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Crater

A Kubernetes-native control plane for shared AI computing clusters

✨ Overview

💡 Why Crater

🌐 Designed For

🎯 Features

🏢 Multi-Tenant Governance

⚙️ Policy-Aware Scheduling

🚀 Workload Lifecycle

🧪 Interactive Development

📦 Data, Model & Image Assets

🧩 Heterogeneous Accelerators

📈 Observability & Operations

⌨️ Web, CLI & Agent Interfaces

🤖 LLM & AI Service Platform

☸️ Kubernetes-Native Deployment

🏗️ Architecture

🚀 Getting Started

1. Prerequisites

2. Set up a cluster

3. Install via Helm (OCI)

📚 Documentation

📁 Repository Structure

💬 Community & Support

🤝 Contributing

📝 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages