Skip to content

raids-lab/crater

Repository files navigation

Crater logo

Crater

A Kubernetes-native control plane for shared AI computing clusters

Operate shared GPU clusters, LLM training and serving workloads, developer environments, and data/model assets across research, education, and enterprise teams.


License Stars PRs Welcome Docs Backend Build Helm Chart Validate

Kubernetes Go React Helm

English · 简体中文

Documentation · Helm Chart · Backend · Frontend · CLI


🏢 Multi-Tenant Governance  ·  ⚙️ Policy-Aware Scheduling  ·  🚀 LLM Training & Serving  ·  🧩 Heterogeneous Accelerators  ·  🤖 AI-Assisted Operations


📖 Table of Contents

✨ Overview

Crater is a Kubernetes-native platform for operating shared AI computing clusters. It helps organizations manage heterogeneous compute resources, submit and govern AI workloads, quickly deploy large-model training and inference environments, and observe cluster health from a unified web console, CLI, and AI-assisted operations interface.

Crater is designed for environments where different teams and workloads share the same GPU cluster: long-running training jobs, bursty lab workloads, interactive notebooks, online AI services, LLM inference services, and offline data processing pipelines. It builds an operational control plane on top of Kubernetes and Volcano, connecting users, accounts, queues, quotas, images, datasets, models, jobs, services, and observability into one workflow.

Jupyter Lab
🧪 Interactive Development — Jupyter, WebIDE, terminals, and external access
Batch Jobs
🚀 AI Workloads — training, serving, templates, and batch jobs
Monitor
📈 Monitoring — real-time metrics & logs
Models
📦 Models & Datasets — manage assets in one place

💡 Why Crater

Kubernetes and Volcano provide powerful low-level scheduling, but operating a shared GPU cluster for many teams still requires a lot of glue. Crater fills that gap:

Without a control plane With Crater
Raw kubectl / YAML access, easy to misuse Web console, CLI, and APIs with role-based, multi-tenant access
GPU usage is hard to attribute and bound Accounts, queues, quotas, approvals, and cost visibility
Everyone rebuilds training/serving manifests by hand Reusable job templates and one-click LLM deployment
Datasets, models, and images scattered across nodes Managed datasets, models, images, and shared storage
Operators and users debug from different tools Unified metrics, logs, GPU analysis, and AI-assisted operations

🌐 Designed For

Crater fits shared AI computing environments in universities, research institutes, enterprise AI teams, and internal platform teams.

Scenario Typical workloads What Crater provides
Research & engineering Model fine-tuning, simulation, scientific computing, large experiments Long-running GPU jobs, reusable environments, data/model mounting, logs, monitoring, and lifecycle controls
Teaching & training Course labs, student projects, virtual experiments, workshops Account and quota management, job templates, burst handling, fair access, and simple web-based submission
LLM training & serving Fine-tuning, evaluation, inference endpoints, model demos, mixed training/serving clusters Fast deployment templates, GPU-aware placement, data/model assets, service access, and train/serve resource governance
Enterprise AI services Internal assistants, document intelligence, multimodal services, inference backends Managed runtime environments, service access, operational visibility, and resource governance
Data processing Dataset preparation, image analysis, batch pipelines, offline preprocessing Storage integration, dataset/model management, schedulable batch jobs, and observability

🎯 Features

🏢 Multi-Tenant Governance

Manage users, accounts, queues, quotas, approvals, and billing-oriented resource visibility. Crater turns a raw GPU cluster into an accountable shared service for teams and projects.

⚙️ Policy-Aware Scheduling

Build on Kubernetes and Volcano to support queue-based admission, priority-aware execution, prequeue policies, and workload placement across heterogeneous resources, including mixed training and serving workloads.

🚀 Workload Lifecycle

Submit, clone, monitor, stop, and inspect AI workloads through Kubernetes-native jobs and reusable templates, from interactive sessions and LLM fine-tuning to long-running batch jobs.

🧪 Interactive Development

Launch containerized Jupyter, WebIDE, web terminals, SSH access, and custom environments without manual cluster setup, giving users a reproducible workspace close to the data and GPUs.

📦 Data, Model & Image Assets

Organize datasets, models, shared files, custom images, registry entries, and platform-side model or dataset downloads so workloads can reuse managed artifacts.

🧩 Heterogeneous Accelerators

Represent GPUs and accelerator models as schedulable resources, supporting NVIDIA GPUs, domestic accelerator cards, vGPU-style resources, and DRA/CDI-based device integration.

📈 Observability & Operations

Troubleshoot with metrics, logs, Grafana dashboards, node status, operation logs, GPU analysis, and runtime inspection, reducing the gap between platform operators and workload owners.

⌨️ Web, CLI & Agent Interfaces

Operate Crater through a web console, command-line interface, HTTP APIs, and agent-oriented command skills for automation, scripted workflows, and AI-assisted operations.

🤖 LLM & AI Service Platform

Support large-model quick deployment, LLM training and inference, inference gateways, model-serving integrations, trusted service integrations, and platform-managed runtime templates.

☸️ Kubernetes-Native Deployment

Deploy with Helm and integrate with Kubernetes, Volcano, Prometheus/Grafana, persistent storage, and cluster add-ons while keeping workloads portable.

🏗️ Architecture

Crater architecture
High-level architecture of Crater and its major components.

Crater is organized around four layers:

  • User interfaces: web console, CLI, HTTP APIs, and agent-friendly command skills.
  • Control plane: authentication, accounts, quotas, scheduling policies, jobs, services, templates, images, datasets, models, approvals, and operations.
  • Execution layer: Kubernetes workloads, Volcano scheduling, accelerator resources, Pods, Services, PVCs, and external access rules for training, serving, and interactive environments.
  • Observability and AI operations layer: metrics, logs, Grafana dashboards, operation records, runtime diagnostics, AI assistant workflows, and admin-side intelligent operations.

🚀 Getting Started

1. Prerequisites

2. Set up a cluster

Pick the option that fits your scenario:

Option Best for Reference
🐳 Kind Local clusters in Docker kind.sigs.k8s.io
🧱 Minikube Single-node local dev & testing minikube.sigs.k8s.io
☁️ Production K8s Production or large-scale deployments kubernetes.io/docs/setup

3. Install via Helm (OCI)

helm registry login ghcr.io
helm install crater oci://ghcr.io/raids-lab/crater --version <chart-version>

💡 The chart version is in charts/crater/Chart.yaml (field version) or in the GitHub releases.

Deployment guides:

📚 Documentation

📁 Repository Structure

Path Description
backend/ Backend services
frontend/ Web UI
cli/ Command-line interface
charts/ Helm charts for deploying Crater
website/ Documentation website source
grafana-dashboards/ Grafana dashboards used by Crater
docs/ Documentation entrypoints and localization resources
hack/ Developer tooling and scripts

💬 Community & Support

🤝 Contributing

We welcome community contributions! The complete development and contribution specification lives in CONTRIBUTING.md: global rules, environment setup (fork, hooks, unified config), workflow, commit convention, PR description template, and per-module entry points.

Per-module specs:

📝 License

Crater is licensed under the Apache License 2.0. See LICENSE.

Copyright 2023-2026 The Crater Project Team, RAIDS-Lab.