Operator-facing documentation for the KRG infrastructure. Architecture and build/deploy basics live in the top-level README, nix/README, ansible/README, and CLAUDE.md; the runbooks and references here are the "how do I actually operate / recover this" layer.
| Doc | When you need it |
|---|---|
| disaster-recovery.md | Rebuild a host (or the whole fleet) from bare metal; what's reproducible vs. what must be restored from backup |
| joining-a-host-to-the-domain.md | One-time AD domain join (NixOS member / Debian / the DC) |
| creating-a-user.md | Create a KRG.LOCAL account and grant it login / GPU access |
| openbao-bringup.md | Day-0 init / unseal / structure of OpenBao on krg-vault (prerequisite for the deploy + Garage runbooks) |
| krg-deploy-ansible-setup.md | Wire krg-deploy's unattended CD to pull NAS secrets from OpenBao via AppRole at apply time |
| garage-ui-bringup.md | One-time first deploy of the Garage S3 admin UI on e4e-nas from krg-deploy |
| e4e-nas-dsm.md | e4e-nas break-glass + one-time migration sheet — the DSM settings with no API surface |
| kerberos-long-jobs.md | Keep a long job / SMB mount authenticated past the Kerberos ticket lifetime (krenew, no keytab) |
| working-remotely.md | SSH from outside UCSD: strict-tier (UCSD + ops) vs compute-tier (global, CrowdSec-gated), and getting unbanned |
| troubleshooting.md | Symptom-first recovery for the gotchas this fleet has hit (boot freeze, AD/login, scratch, ZFS) |
| Doc | What it covers |
|---|---|
| fleet-inventory.md | Every host — IP, role, VMID, hypervisor — plus the Prometheus monitoring map |
| waiter-topology.md | waiter storage (ZFS/impermanence) + network diagrams |
| scratch-greenfield.md | waiter /scratch ZFS-native design (replaced autotier): pools/vdevs, the NFS overflow + scratch-restore, how to operate it |
| fabricant-topology.md | fabricant (Proxmox) storage + NFS + firewall diagrams |
| krg-ldap-topology.md | krg-ldap (AD DC) storage + network diagrams |
| krg-prod-iac.md | How the krg-prod + e4e-nas IaC maps onto this repo's nix/ansible/terraform layers, and the NAS standup plan |
| e4e-prod-tenant-platform.md | e4e-prod multi-tenant platform for student-built projects: sealed microVM per tenant, edge TLS + OpenBao PKI, the krg.tenants interface, onboarding |
Topology and monitoring diagrams are Mermaid and render inline on GitHub.
Decision/analysis records that aren't operate-or-recover runbooks:
| Doc | What it covers |
|---|---|
| scratch-architecture-options.md | Historical: the /scratch redesign options (autotier symptoms, rejected brownfield, cost analysis); as-built is scratch-greenfield.md |
| e4e-nas-crowdsec-evaluation.md | Why CrowdSec is not added to the NAS — DSM-native AutoBlock + Firewall + GeoIP instead |
Immutable decision records under adr/. All Accepted.
| ADR | Decision |
|---|---|
| 0001 | Git is the single source of truth for krg-prod + e4e-nas; UI/by-hand changes are drift to be reconciled, not blessed |
| 0002 | Use Garage (not MinIO) for S3 object storage |
| 0003 | Garage runs on the NAS (dedicated storage), not the IO-budgeted krg-prod VM — deployment mechanism later amended by 0007 |
| 0004 | The krg-prod VM operates under a disk-IO budget — only low-IO workloads belong on it |
| 0005 | krg-prod IaC integrates into this repo; OpenTofu over Terraform; krg-deploy is the control node |
| 0006 | No Qualys/Trellix (OEC) on DSM — DSM-native Security Advisor replaces it |
| 0007 | The DSM tofu/ansible split follows API surface (real provider resources vs CLI-only), not "appliance-ness" |
| 0008 | e4e-prod is a multi-tenant platform for student-built projects — sealed microVM per tenant, repo-owned deploys, LE-terminate-then-re-encrypt edge |
| 0009 | Lab-internal PKI is a private OpenBao CA hooked into AD — machines issue via AppRole, humans via AD-group-gated LDAP; CA trusted fleet-wide; separate from public Let's Encrypt |
| 0010 | KRG.LOCAL AD structure (groups, service accounts, ACLs, password policy) is IaC in spec/krg-ad + ansible/krg-ad; apply is non-authoritative (adds, never deletes); humans come from roster |
| 0011 | Cross-layer deploy ordering: a phased pipeline (foundation → converge → verify), not a single linear order — back-edges between Ansible/NixOS/Tofu can't be solved by reordering |
| 0012 | Endpoint device management is a lab-owned control plane, split Fleet (MDM for Windows/macOS/iOS/Android) + the flake (NixOS) — not solely campus Intune; FDE escrow + remote wipe on the lab's own timeline |
| 0013 | SSO via Authentik is the front door, AD is the identity source — services federate (Authentik first, direct-AD fallback) rather than ship local accounts; lab members are AD users, web-only collaborators are minimal-trust Authentik-local accounts |
| 0014 | Proxmox VE authenticates via the Authentik LDAP outpost (not OIDC — Android app can't redirect; not raw AD — PVE won't expand nested groups / would clutter AD); the outpost flattens groups, PVE keys off proxmox-admins |