Skip to content

Replace bwrap+socat with vendored srt-launcher on Linux#272

Open
dylan-conway wants to merge 14 commits into
mainfrom
remove-bwrap-socat
Open

Replace bwrap+socat with vendored srt-launcher on Linux#272
dylan-conway wants to merge 14 commits into
mainfrom
remove-bwrap-socat

Conversation

@dylan-conway

Copy link
Copy Markdown
Collaborator

The Linux sandbox no longer depends on bubblewrap or socat. They're replaced by srt-launcher, a single statically-linked Rust binary vendored under vendor/srt-launcher-rs/ and shipped prebuilt under vendor/srt-launcher/{x64,arm64}/.

Two commits:

  1. vendor: add srt-launcher Rust crate — the binary itself. run (namespace + mount + relay-fork + seccomp + exec), relay (host-side Unix→TCP bridge for the external-proxy case only), connect (ssh ProxyCommand HTTP CONNECT helper). The mount module is written against raw libc so it reads side-by-side with bubblewrap's bind-mount.c/bubblewrap.c for review.
  2. linux: replace bwrap+socat with vendored srt-launcher — TS wiring, tests, CI, README.

Architecture

Linux, internal proxy (default):
  tool → 127.0.0.1:3128/1080  (relay forked by srt-launcher inside the netns)
       → unix socket          (bind-mounted, in a private 0700 dir)
       → proxy server         (host node process, listening on the unix socket)

Linux, external proxy (network.httpProxyPort/socksProxyPort set):
  ... → unix socket → srt-launcher relay (host) → 127.0.0.1:<port>

macOS: unchanged (TCP, seatbelt).

There is one namespace layer instead of two: PID 1 and the relays are forked from srt-launcher with PR_SET_DUMPABLE=0 from the start, so the seccomp'd worker can't ptrace or write /proc/N/mem against them regardless of kernel.yama.ptrace_scope. The nested PID namespace from apply-seccomp existed to hide bwrap's init, which we don't have anymore.

Breaking config changes

  • bwrapPath, socatPath, seccomp config fields are removed. New: launcher: { path?, argv0? } for overriding the binary location or invoking it as a multicall sub-binary.
  • SandboxManager.getProxyPort() / getSocksProxyPort() return undefined on Linux when the internal proxy is used (it listens on a Unix socket, not a TCP port). They still return the configured port when network.httpProxyPort/socksProxyPort is set.

Runtime dependencies

before after
Linux bubblewrap, socat, ripgrep ripgrep

srt-launcher is statically linked against musl; no host libc requirement.

Build dependencies (CI / npm run build:launcher)

rustup + <arch>-unknown-linux-musl target, gcc + libseccomp-dev (for the build-time BPF generator), musl-tools.

What didn't change

The bind-mount flag surface (--bind/--ro-bind/--tmpfs/--proc/--dev) and per-path mount semantics match bwrap's, so generateFilesystemArgs() and the symlink/ancestor/mandatory-deny handling that feed it are unchanged. macOS path is untouched.

A single statically-linked binary that provides the Linux sandbox
primitives we currently get from three external programs:

  bwrap         -> srt-launcher run     namespace + filesystem isolation + exec
  apply-seccomp -> srt-launcher run     unix-socket-block seccomp before exec
  socat         -> srt-launcher run     in-sandbox TCP<->Unix relays
                   srt-launcher relay   host-side bridge to an external proxy
                   srt-launcher connect ssh ProxyCommand HTTP CONNECT helper

`run` does one unshare(USER|PID|NS[|NET]), forks PID 1, sets up the mount
namespace via the same pivot_root + bind-remount sequence bubblewrap uses,
forks the proxy relays (with PR_SET_DUMPABLE=0, so the seccomp'd worker
can't ptrace them), forks the worker, applies the baked-in BPF filter,
and execs. There is no nested PID namespace: PID 1 and the relays are our
own forks with DUMPABLE=0 from the start, which is what apply-seccomp's
nested layer existed to achieve when bwrap's init was the thing to hide.

The mount module is written against raw libc so it reads side-by-side
with bubblewrap's bind-mount.c / bubblewrap.c for review. The BPF filter
is the existing seccomp-unix-block.c output, embedded via include_bytes!.

Builds with `npm run build:launcher` (rustup + musl target + gcc +
libseccomp-dev for the build-time BPF generator). ~425 KB stripped per
arch.

apply-seccomp.c and vendor/seccomp/build.ts are removed; the BPF
generator (seccomp-unix-block.c) stays as a build-time dependency.
The Linux sandbox no longer depends on bubblewrap or socat. The
vendored srt-launcher binary (previous commit) covers all three roles:
namespace/filesystem isolation, the in-sandbox proxy relays, and the
unix-socket-block seccomp filter.

Functional changes:

- The HTTP and SOCKS proxy servers listen on a Unix socket on Linux
  (in a private mode-0700 mkdtemp directory). The in-sandbox relay
  connects to that socket directly via a bind-mount; there is no
  host-side TCP loopback hop or bridge process for the internal-proxy
  case. macOS keeps its TCP listener.

- When network.httpProxyPort / socksProxyPort points at an external
  proxy, `srt-launcher relay` runs on the host as a Unix-to-TCP bridge
  to that port. The relay sets PR_SET_PDEATHSIG and signals readiness
  on stdout, so cleanup is `kill('SIGKILL')` with no exit-event
  monitoring.

- wrapCommandWithSandboxLinux emits a single
  `srt-launcher run [opts] -- bash -c '<cmd>'` argv. No shell glue
  between the sandbox layer and the user command.

- GIT_SSH_COMMAND uses `srt-launcher connect` as ssh's ProxyCommand
  (replaces `socat - PROXY:...`). The launcher validates host/port
  before building the CONNECT request line.

Config / API:

- `bwrapPath`, `socatPath`, and `seccomp` config fields are removed.
  A `launcher: { path?, argv0? }` field covers the override and
  multicall cases.
- `getProxyPort()` / `getSocksProxyPort()` return undefined on Linux
  when the internal proxy is used (it has no TCP port).
- `cleanupBwrapMountPoints` -> `cleanupSandboxMountPoints` (the
  mechanism is intrinsic to bind-mount deny, not to bwrap).

CI: build:seccomp -> build:launcher; install rustup + musl target +
musl-tools instead of bubblewrap + socat.

The bind-mount flag surface (--bind / --ro-bind / --tmpfs / --proc /
--dev) and the per-path mount semantics are unchanged from the bwrap
path, so generateFilesystemArgs() and the symlink/ancestor handling
that feed it are kept as-is.
…_char; macOS test gate

- mount.rs: in a userns, mounts inherited from the parent ns are
  MNT_LOCKED and MS_REMOUNT returns EPERM even when only adding
  restrictions. Tolerate that for submounts (the kernel won't let the
  sandbox loosen them either, so skipping is no weaker than the host);
  EPERM on the bind root itself stays fatal. Surfaced on the GitHub
  runner via /proc/sys/fs/binfmt_misc — bwrap didn't hit this because
  the apt-packaged bwrap is setuid and ran as real root.

- mount.rs: ttyname_r buffer uses libc::c_char (u8 on aarch64, i8 on
  x86_64) instead of i8.

- srt-launcher.test.ts: gate the vendored-binary-resolution test on
  Linux (the binary isn't built or shipped on macOS).
CLI surface: --unshare-pid, --new-session, --die-with-parent, and
--chdir are removed. The first three were always passed and gate
correctness properties (PID-1/reaper architecture, TIOCSTI defense,
orphan prevention) that are not tunables; they're now applied
unconditionally. --chdir was never emitted (the launcher captures and
restores the spawn-time cwd itself).

unsafe reductions:
- die_errno! delegates to die! via io::Error::last_os_error;
  errno_str() and the raw __errno_location() reads are gone
- FORWARD_TARGET: static mut -> AtomicI32 (Relaxed)
- libc::isatty -> stdin().is_terminal()
- CStr::from_ptr on the ttyname buffer -> CStr::from_bytes_until_nul
- libc::setenv loop -> env::set_var
- ready-fd write+close -> File::from_raw_fd().write_all()
- mount() returns Result<(), i32> so callers don't re-read errno

Dead code:
- bitflags_lite! macro replaced with enum BindKind (the variants are
  never combined)
- parse_tcp_target's "localhost" arm (callers always pass 127.0.0.1)
- connect's --proxy default (caller always passes it; now required)
- relay_fork return value, Clone derives on RelaySpec/MountOp
- relay_main trailing ExitCode::SUCCESS -> unreachable!()

Also: dedupe wait-status decode, merge geteuid/getegid into one
unsafe block, use io::Error from write_all instead of re-reading
errno, fix outer-stub waitpid to not decode status on error return.

Net: -44 LOC, ~-13 unsafe blocks. clippy -D warnings clean on x64
and arm64.
…x pidns

The relays now fork from the stub between the NET unshare and the NS|PID
unshare, landing them in {host pidns, host mountns, sandbox netns}:

  stub             unshare(USER?|NET) → lo up → fork relays → unshare(NS|PID)
  ├─ relay :3128   host pidns/mountns, sandbox netns — invisible to workload
  ├─ relay :1080   (same)
  ══════════════════ mount + PID ns boundary
  └─ PID 1         setup_filesystem → fork worker → reap (unchanged)
     └─ worker     seccomp → exec

The workload has no PID for the relays — kill, ptrace, and /proc
enumeration all return ESRCH/ENOENT — which is structurally stronger
than the previous DUMPABLE=0 protection and closes the `kill -9 -1`
self-DoS that DUMPABLE=0 doesn't.

The relay also now opens an O_PATH fd to the bridge socket at startup
(before the workload exists) and connects via /proc/self/fd/N, pinning
the inode. A workload with rw access to the socket's directory (via a
bind mount of /tmp) can no longer redirect the relay to a different
host unix socket by swapping the path.

Relays drop all caps and set PR_SET_PDEATHSIG immediately after fork
(they no longer get the kernel's pidns-teardown SIGKILL); the stub
reaps them and explicitly kill(-pgid, SIGKILL)s them after PID 1 exits.

The stub now also drops caps after forking PID 1 (it only waits and
forwards signals from there). DUMPABLE=0 is set on the stub after the
userns map writes (writing /proc/self/{uid,gid}_map needs the files to
be owned by us, which DUMPABLE=0 prevents) and inherited by PID 1;
relays set it themselves.

Three processes (stub → PID 1 → worker) is the floor: unshare(NEWPID)
requires a fork to enter, and execve resets signal handlers — so a
PID 1 that is the workload can't receive SIGTERM.

mount.rs is unchanged; PID 1 still does all of setup_filesystem.

Tests added: relay survives `kill -9 -1` from inside the pidns; relay
ignores a workload-side socket-path swap.
@LeoniePhiline

Copy link
Copy Markdown

Why, though?

Why try and reinvent (even vibe code) niche security sandbox tooling when widely used, trusted, reviewed industry standards exist?

Reconcile Windows proxy port-range support with Linux unix-socket
listen targets in sandbox-manager; keep WindowsConfigSchema, drop
bwrapPath/socatPath/SeccompConfig.
…untinfo lookup

Resolve bind sources on host root before pivot; capture ttyname(1)
before unshare; refuse to run setuid; die on root remount failure
and on missing mountinfo entry; tolerate only EACCES on submounts.
…inherited fds before exec

Re-arm PDEATHSIG and set TCP keepalive in host-relay children; retry
read on EINTR in splice; skip kill(-pgid) for already-reaped relays;
const-assert BPF blob size; cwd $HOME fallback + $PWD export.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants