Audience: operators running crate on a shared host (multiple operators on one machine) and contributors extending the privileged surface.
Applies to: 1.1.16 (rootless model + per-tenant authz series 1.1.12 →
1.1.15 covering every privops verb that carries an operator-controlled
ownership signal). For the ≤ 0.9.x setuid model and the migration, see
rootless-migration.md.
This document states, explicitly, where cross-tenant isolation is enforced and where it is not — so deployments size their trust boundaries correctly and future work doesn't silently regress them.
Crate has two planes, with two different trust models:
-
The privileged execution plane —
crated's privops surface. Since 1.0.0crate(1)is no longer setuid (Makefile,-m 0755); it is an unprivileged client. Every root operation (jail create/destroy, ZFS attach, mount, RCTL, interface/firewall config, signal) is performed bycrated(which runs as root) when asked over privops — eitherPOST /api/v1/privops/<verb>(admin-only) or the libnvAF_UNIXprivops socket (group-gated). This plane does not arbitrate between operators. Anyone who can reach privops has root-equivalent control over every jail on the host. It is a single trust domain — the same property the old setuidcrate(1)had, relocated into the daemon. -
The pooled observability / control plane. This is where per-tenant isolation is enforced, via pool ACLs + token scope, on the entry points that carry caller identity: dedicated control sockets (
getpeereidgid + pool ACL), remote bearer tokens (scope + pool ACL), and the ws-console (bearer + pool ACL). Surface:list/get/stats/logs/start/stop/restart/PATCH resources/ interactive console.
The invariant: multi-tenant isolation between mutually-distrusting
operators lives only on the pooled plane's identity-carrying entry
points (control-socket pools, token pools + scope), enforced
daemon-side per request and keyed on the target jail's pool. The
privops plane, and local Unix-socket access to the main HTTP API, are
single trust domains. A hostile-multi-tenant deployment must therefore
hand untrusted operators a pool-scoped control socket or bearer
token only — never privops-socket access, never an admin token, and
never the main API's Unix socket.
In 0.7.19 the single-trust-domain privileged plane was the setuid-root
crate(1) binary. The rootless track (0.9.0–1.0.0, see
rootless-migration.md) moved every
privileged operation out of crate(1) and into crated's privops
surface; 1.0.0 removed the setuid bit (Makefile, comment at the
crate install line: "setuid bit removed. crate(1) runs as the
operator and delegates privileged operations to crated(8)").
The single-trust-domain property did not disappear — it relocated.
Reasoning about isolation on 1.1.16 means reasoning about who can reach
privops, not who can run crate(1).
crate(1) is installed -m 0755 (Makefile; setuid removed 1.0.0).
crated runs as root (rc.d) and is the only privileged binary. It
performs root operations only in response to a closed set of privops
verbs (create_jail, destroy_jail, attach_zfs, set_rctl,
configure_iface, add_pf_rule, signal_jail, apply_devfs_ruleset,
… — lib/privops_pure.h). Two transports reach it:
- HTTP —
POST /api/v1/privops/<verb>. Gated admin-only:isAuthorized(req, config, "admin")(daemon/routes.cpp:1007). The handler comment states the design intent explicitly — "Privops touch host-wide state … so per-container scope from the F2 surface doesn't apply" (daemon/routes.cpp:997-999). On this path the operator uid stays0(cpp-httplib doesn't expose the connection fd forgetpeereid), so per-user audit is a no-op (daemon/routes.cpp:1013-1019). - libnv
AF_UNIXsocket (0.9.14). Group-gated: the listenerchmods the socket to its mode andchowns itroot:<group>(daemon/privops_listener.cpp:179,188; default mode0660,daemon/config.h:117-119).getpeereid(2)extracts the peer uid (daemon/privops_listener.cpp:90) and feeds both the per-user audit / namespacing hook and the authorize-before-dispatch gate below.
Verb dispatch is parse → validate → handle (dispatchPrivOp /
dispatchPrivOpFromMap, daemon/privops_handlers.cpp). Authorization
differs by transport:
- HTTP: no per-resource check —
admin-only and host-wide by design (daemon/routes.cpp:997-999). - libnv (real peer uid): as of 1.1.12 an authorize-before-dispatch
gate (
dispatchPrivOpFromMap→PrivOpsAuthzPure::authorize,lib/privops_authz_pure.cpp) enforces per-user ownership for the verbs that carry a robust ownership signal, keyed on the caller'scomposeForUidenv:attach_zfs/detach_zfs(thedatasetmust lie within the caller's ZFS prefix<master>/<uid>) andset_loginclass_rctl/clear_loginclass_rctl(theloginclassmust be the caller'scrate-<uid>). A foreign target is denied403before the handler runs (fail closed). 1.1.13 extends the same gate to jid- and name-scoped verbs:set_rctl,clear_rctl,set_jail_cpuset,query_jail_rctl,signal_jail,destroy_jail. A daemon-owned jid→owner registry (lib/jid_owner_registry.*) records the operator uid atcreate_jailtime; subsequent verbs from a different operator are denied403against the same registry. 1.1.14 extends the same registry with a longest-prefixbyPathlookup that gates the path-scoped verbs —mount_nullfs/unmount_nullfs(bytarget) andapply_devfs_ruleset/add_devfs_unhide_rule(bymount_path). A path that lies inside another operator's registered jail is denied403(DenyForeignPath). 1.1.15 closes the last narrow item:create_jail's brand-newpathargument is matched against the caller's per-userpathPrefix(composed frompath_master_prefix:in crated.conf); a foreign target is denied403(DenyForeignCreatePath).
The remaining verbs still pass the gate: host-global verbs (iface/pf/ipfw/nat/epair) cannot be pool-scoped and stay host-wide by design. With 1.1.12+1.1.13+1.1.14+1.1.15 the per-tenant gate covers every privops verb that carries an operator-controlled ownership signal in its request. The per-verb handlers remain uid-blind; the gate runs ahead of them.
Consequence: whoever can reach privops — an
adminbearer token, or membership in the privops socket's group — still has host-wide control over the un-gated surface (firewall, interfaces, other host-global verbs that touch shared state). The 1.1.12 + 1.1.13 + 1.1.14 + 1.1.15 gates close cross-tenant ZFS-dataset, RCTL-umbrella, jid-, name-, path-scoped, and create-jail-path access on the libnv path — every verb that carries an operator-controlled ownership signal in its request is now gated. The shared host-global verbs remain, by design, a single trust domain — handing an operator privops access is still close to handing them the old setuidcrate(1)for those.
Rootless mode derives per-operator paths, ZFS prefixes, network
sub-CIDRs and an RCTL umbrella class from the connecting operator's uid
(lib/per_user_*, lib/runtime_paths_pure.*; see
rootless-migration.md). This cleanly
separates honest operators — alice's and bob's web jails land in
different ZFS subtrees and CIDRs.
It is not an adversarial boundary on the privops plane. The per-uid
prefix is computed client-side (in crate(1)) and passed in the
request; crated does not re-derive or validate the request's
jid / dataset / path against the peer uid. A hostile privops-group
member can craft a raw nvlist request naming another operator's prefix
and crated will act on it. So "bob can't run a jail in alice's ZFS
prefix" (rootless-migration.md) holds for
honest clients going through an honest crate(1); it is not a
daemon-enforced cross-tenant boundary. Enforced adversarial isolation
lives on Plane 2.
crated's non-privileged surface. Per-tenant isolation is enforced on
the entry points that carry caller identity.
Per-group AF_UNIX sockets under /var/run/crate/control/, with three
layers of defense:
- Filesystem perms (kernel). Socket
chmod'd to the spec mode andchown'droot:<group>so only group members canconnect(2)—daemon/control_socket.cpp:582,586. getpeereid(2)gid re-check. Even if the mode is loosened, the peer's gid must equal the socket's expected gid —daemon/control_socket.cpp:395feedingControlSocketPure::authorize(daemon/control_socket_pure.cpp:277,Decision::DenyGidMismatch).- Pool ACL. For per-container actions the container's pool
(
PoolPure::inferPool,daemon/control_socket_pure.cpp:292) must be visible on the socket'spoolslist —poolVisibleOnSocket(:293, defined:406). Mutating actions (PATCH resources,POST start/stop/restart) additionally require theadminrole (:282, 0.8.13).
Result: alice's socket (pools: ["alice"]) cannot observe, patch, or
start/stop a jail in bob's pool. This is the mechanism a multi-tenant
deployment relies on. Note these sockets cover the control surface
only — they do not expose create_jail / attach_zfs, which are
privops (Plane 1).
HTTP API clients authenticate with a bearer token carrying expiry
(daemon/config.h:21; 0 == never), scope path-globs
(daemon/config.h:26; matched by AuthPure::pathInScope,
lib/auth_pure.cpp:81), a role, and a pool ACL (daemon/config.h:31).
checkBearerAuthFull gates expiry + scope + role
(lib/auth_pure.cpp:101); the per-container pool gate is
isAuthorizedForContainer → PoolPure::tokenAllowsContainer
(daemon/auth.cpp:83-84).
Caveat: an
adminbearer token also unlocks the privops HTTP plane (2a above is role-gated, privops isadmin-gated). An admin token is therefore host-wide, not pool-confined. Onlyviewer/pool-scoped tokens are isolation-bearing.
The websocket console grants an interactive jexec shell inside a
jail, so its gate is load-bearing. It requires an admin bearer token
(daemon/ws_console.cpp:231) and that the jail's pool be allowed by
the token (PoolPure::inferPool / tokenAllowsContainer,
daemon/ws_console.cpp:254-255).
The main crated HTTP API is also reachable over a local Unix socket.
On that path cpp-httplib does not expose the peer fd, so getpeereid(2)
is not wired in. Unix-socket peers are trusted wholesale:
isAuthorizedreturnstrueimmediately for Unix peers, bypassing bearer auth —daemon/auth.cpp:48-49.isAuthorizedForContainerlikewise bypasses the pool ACL for Unix peers —daemon/auth.cpp:74-75.
The socket file mode (default 0660 root:wheel) is the only gate —
daemon/auth.cpp:36-40. So local access to the main API is, like
privops, a single trust domain. getpeereid-based auth here is
still future work (roadmap §5.3). Only the dedicated control sockets
(2a) carry per-pool identity locally.
Cross-tenant isolation = pool ACLs (control-socket
pools, tokenpools) + token scope, enforced per request, keyed on the target jail's pool. It exists only on entry points that carry caller identity (Plane 2a/2b/2c). The privops plane and the main API's Unix socket are single trust domains.
This is the contract any new privileged surface inherits.
1.1.12 began closing this gap: the libnv path now authorizes
attach_zfs/detach_zfs (by ZFS prefix) and the loginclass-RCTL verbs
(by crate-<uid>) before dispatch. The rest of Plane 1 is still a
single trust domain — host-wide admin token on HTTP, host-wide group
membership on the libnv socket for every un-gated verb. That is fine
as long as privops access is treated as equivalent to handing out the
old setuid crate(1) — i.e. only ever given to fully-trusted
operators.
The unresolved tension: the rootless model requires crate(1) to
reach privops to create a jail at all, while the per-user namespacing
(above) is marketed as multi-tenant isolation. To make privops itself
safe for mutually-distrusting operators, the following MUST hold for
every verb — or the per-user split is honest-operator hygiene, not a
security boundary:
-
Authorize before dispatch.
getpeereiduid/gid identifies the caller; it does not authorize the operation. Every privileged verb must run an ownership check — the same shape aspoolVisibleOnSocket/tokenAllowsContainer— keyed on the target jail/pool/dataset, before the operation runs. Done (1.1.12): dataset and loginclass verbs (lib/privops_authz_pure.cpp). Done (1.1.13): the jid- and name-scoped verbs —set_rctl,clear_rctl,set_jail_cpuset,query_jail_rctl,signal_jail(gated byjid) anddestroy_jail(gated by jailname). A daemon-owned jid→owner registry (lib/jid_owner_registry.*, persisted at/var/db/crate/jid_owners.tsv) records the operator uid atcreate_jailtime; subsequent jid/name-scoped verbs from a different operator are denied 403 before the handler runs. Jails that pre-date 1.1.13 are not in the registry — the gate's bootstrap concession allows them through to preserve the upgrade path. Done (1.1.14): the path-scoped verbs —mount_nullfs/unmount_nullfs(gated bytarget) andapply_devfs_ruleset/add_devfs_unhide_rule(gated bymount_path). The registry now exposes a longest-prefixbyPathlookup; a path inside a tracked jail owned by another uid is denied 403 (DenyForeignPath); paths outside every registered jail fall through under the same bootstrap concession. Done (1.1.15): the last narrow item — thecreate_jailpathargument.PerUserEnvPure::ConfiggainspathMasterPrefix(configured via crated.confpath_master_prefix:);composeForUid()derivesenv.pathPrefix = <master>/<uid>. The gate runsPrivOpsAuthzPure::pathOwned(req.path, env.pathPrefix)— slash-anchored prefix match, identical shape todatasetOwnedfor ZFS. A foreign target is denied403(DenyForeignCreatePath). EmptypathMasterPrefixkeeps the legacy shape (Allow), so existing deployments don't need to reconfigure on upgrade. -
Per-operator namespacing is convenience, not a boundary. Any
path/jid/datasetargument crossing the privops socket must be re-derived or validated daemon-side against the caller's uid-prefix — never taken at face value. Done fordataset(1.1.12), forjid/ jailname(1.1.13, via the registry), for path-scoped runtime verbs (1.1.14, longest-prefixbyPathon the same registry), and forcreate_jail's brand-new path (1.1.15, slash-anchored prefix againstenv.pathPrefix). -
Fail closed on identity loss. On any path that authorizes, a
getpeereidfailure must deny. It may degrade to a no-op only for identity-tagged side effects that are not access decisions (e.g. the audit tail — its current behavior).
The path-scoped verbs in (1) remain host-wide for now. A multi-tenant deployment that exposes them to operators directly still needs a trusted broker rather than handing operators raw privops-socket access.
- Give untrusted operators: a pool-scoped dedicated control socket
(2a) and/or a
viewer/pool-scoped bearer token (2b) only. - Do not give untrusted operators: privops-socket (group) access,
an
adminbearer token, the main API's Unix socket, or acrate(1)that must reach privops to create jails. - Honest-operator separation (per-user paths/datasets/CIDRs) and enforced adversarial isolation (pool-ACL'd Plane 2) are different guarantees — don't conflate them.
Running a Wayland compositor inside a jail can expose host device nodes that the default devfs ruleset hides:
gui.backend: headless(default) — renders offscreen and is surfaced over VNC (wayvnc). It exposes no input devices; it touches/dev/dri/*only when a render node is present (GPU acceleration). Safe to hand to a semi-trusted workload.gui.backend: drm— the jail drives the physical GPU and input directly. crate unhides/dev/dri/*and/dev/input/*in the jail's devfs view and binds the hostseatdsocket. This is a real privilege surface: a process in such a jail can read all input events host-wide (every keystroke, every pointer event) and talk to the GPU at the KMS level. Treat adrm-backend jail as part of the host's trusted display domain — do not grant it to a mutually-distrusting tenant. It is opt-in precisely because the default (headless) avoids this exposure.
rootless-migration.md— the 0.9.0–1.0.0 setuid → privops migration and the per-user model.security-command-paths.md— absolute command paths / env-sanitization (CWE-426). Now most relevant tocrated, which is the process that execs host tools as root.implementation-roadmap.md§5.1 (high-level REST write endpoints), §5.3 (getpeereidauth on the main API).