Skip to content

[research] Supporting alternative container runtimes (gVisor, Kata Containers) in AWF #3264

@lpcox

Description

@lpcox

Summary

Research findings and recommendations for supporting alternative OCI-compliant container runtimes (gVisor/runsc, Docker SBX/Kata Containers) in AWF, while maintaining security guarantees for network isolation, volume mounts, and syscall filtering.

Current Architecture

AWF currently uses the default Docker runtime (runc) with extensive security hardening:

Layer Mechanism Files
Syscall filtering Seccomp deny-by-default profile (~350 allowed syscalls) containers/agent/seccomp-profile.json
Capabilities Agent: cap_add: SYS_CHROOT, SYS_ADMIN (dropped before user code), cap_drop: NET_RAW, SYS_PTRACE, SYS_MODULE, SYS_RAWIO, MKNOD src/services/agent-service.ts:97-105
Network isolation iptables DNAT via init container sharing agent network namespace containers/agent/setup-iptables.sh, src/host-iptables-rules.ts
Filesystem isolation chroot to /host with selective bind mounts (system dirs RO, workspace RW) containers/agent/entrypoint.sh, src/services/agent-volumes.ts
Privilege escalation no-new-privileges:true, UID/GID remapping, capability drop via capsh src/services/agent-service.ts:111, containers/agent/entrypoint.sh:356-365
Process limits pids_limit: 1000, mem_limit: 6g src/services/agent-service.ts:119-122
AppArmor Set to unconfined (required for procfs mount, safe because SYS_ADMIN dropped before user code) src/services/agent-service.ts:113

Docker-Specific Dependencies

AWF is tightly coupled to Docker in these areas:

  1. Docker Compose v3+ — Orchestrates all containers (docker compose up/down/logs/wait)
  2. Docker CLI commandsdocker inspect, docker logs, docker network create/rm, docker rm -f
  3. Docker bridge networking — Fixed subnet 172.30.0.0/24 with custom bridge name fw-bridge
  4. Docker socket — Optional DinD support via /var/run/docker.sock mount
  5. network_mode: service:agent — iptables-init shares agent's network namespace
  6. Docker healthchecks — Service dependency ordering (Squid → Agent → iptables-init)
  7. security_opt — Seccomp profile, no-new-privileges, AppArmor configuration
  8. tmpfs overlays — Hide sensitive files (docker-compose.yml, MCP logs)

Runtime Analysis

gVisor (runsc)

gVisor interposes a user-space kernel (the "Sentry") between the container and the host kernel, intercepting all syscalls. This provides stronger isolation than seccomp alone.

Compatibility Assessment

AWF Feature gVisor Support Impact
iptables ⚠️ Partial — Only supports featureset for Docker-in-gVisor. DNAT rules may not work. CRITICAL — AWF's entire network security relies on iptables DNAT to Squid
chroot ✅ Full support (syscall 161) Compatible
Seccomp profiles ⚠️ Redundant — gVisor already intercepts all syscalls. Seccomp applied to the Sentry, not the sandbox. Need to verify AWF's seccomp profile doesn't conflict
Capabilities capget/capset fully supported Compatible
network_mode: service: ⚠️ Requires all containers sharing a network namespace to use the same runtime Must apply --runtime=runsc consistently
Bind mounts / volumes ✅ Supported, but no block device filesystems (ext4 etc.) inside sandbox AWF only uses bind mounts — compatible
tmpfs ✅ Fully supported Compatible
procfs ✅ Supported (gVisor provides its own /proc) May need adjustment — AWF mounts fresh /host/proc
Healthchecks ✅ Supported via Docker Compatible
Resource limits ⚠️ cgroups for accounting only, not enforcement within sandbox mem_limit and pids_limit may not be enforced
AppArmor ⚠️ Not applicable inside gVisor sandbox Harmless — already unconfined
OCI image format ✅ Fully compatible No image changes needed

Critical Issue: iptables in gVisor

From gVisor docs: "iptables are only partially supported. The general goal is to support the featureset necessary to be able to run Docker in gVisor, but not necessarily further."

AWF's iptables rules in setup-iptables.sh include:

  • iptables -t nat -A OUTPUT ... -j DNAT --to-destination 172.30.0.10:3128 (redirect HTTP/HTTPS to Squid)
  • iptables -A OUTPUT ... -j DROP (block dangerous ports)
  • iptables -A OUTPUT ... -j LOG (audit logging)
  • ip6tables rules (IPv6 blocking)

If gVisor's iptables doesn't support DNAT rules, the entire AWF network security model breaks. Traffic would bypass Squid and reach the internet directly.

Mitigation Strategies for iptables

  1. Run iptables-init with runc, agent with gVisor: Use --runtime=runsc only for the agent container. The iptables-init container (which shares the agent's network namespace via network_mode: service:agent) would need the same runtime, making this approach unworkable since it needs full iptables.

  2. Host-level network isolation instead: Move ALL iptables rules to the host's DOCKER-USER chain (AWF already does this partially in src/host-iptables-rules.ts). The host runs native Linux, so iptables always works. This would make the in-container iptables-init redundant when using gVisor.

  3. Use gVisor's network passthrough mode: Configure --network=host for runsc so gVisor uses the host network stack. But this defeats gVisor's network isolation benefits.

  4. Use gVisor's netstack with proxy env vars only: Rely entirely on HTTP_PROXY/HTTPS_PROXY env vars (which AWF already sets) plus gVisor's netstack isolation, dropping iptables DNAT as defense-in-depth. Acceptable if gVisor's syscall interposition is considered sufficient to prevent proxy bypass.

Docker SBX / Kata Containers

Kata Containers runs each container inside a lightweight VM (using QEMU, Cloud Hypervisor, or Firecracker). This provides VM-level isolation with OCI compatibility.

Compatibility Assessment

AWF Feature Kata Support Impact
iptables ✅ Full — runs a real Linux kernel inside the VM Compatible
chroot ✅ Full — real Linux kernel Compatible
Seccomp ✅ Applied inside the guest VM Compatible
Capabilities ✅ Full Linux capability model Compatible
Network namespace sharing ⚠️ Complex — each Kata container is a separate VM CRITICALnetwork_mode: service:agent won't work natively
Bind mounts ⚠️ File sharing between host and VM uses virtio-fs or 9pfs — performance overhead Works but slower I/O
tmpfs ✅ Supported Compatible
Resource limits ✅ Enforced at VM level Better isolation than cgroups
OCI image format ✅ Fully compatible No image changes needed
Docker socket ⚠️ Mounting host Docker socket into a VM is complex DinD support may break

Critical Issue: Network Namespace Sharing

AWF's iptables-init pattern (network_mode: service:agent) requires containers to share a network namespace. Kata Containers run each container in a separate VM, making namespace sharing fundamentally incompatible.

Mitigation: Move iptables setup into the agent entrypoint itself (remove the separate init container) or use Kata's sandbox concept where multiple containers share a single VM.


Recommended Architecture

Option A: Runtime Abstraction Layer (Recommended)

Add a --container-runtime CLI flag that selects a runtime profile:

awf --container-runtime gvisor|kata|runc|auto ...

Each profile adjusts the security model:

Aspect runc (default) gvisor kata
Syscall filtering Seccomp profile gVisor Sentry (seccomp optional) Seccomp inside VM
Network isolation iptables DNAT + Squid proxy Host iptables + Squid proxy (no in-container iptables) iptables inside VM + Squid proxy
iptables-init Separate init container Removed — host-level rules only Merged into agent entrypoint
network_mode: service: Used Not used (unnecessary) Not used (same VM sandbox)
Capability grants SYS_CHROOT, SYS_ADMIN Minimal (gVisor handles isolation) SYS_CHROOT, SYS_ADMIN
Runtime flag (none) runtime: runsc in compose runtime: kata in compose

Implementation Changes Required

1. Docker Compose Generation (src/compose-generator.ts, src/services/agent-service.ts)

// Add runtime field to DockerService interface (src/types/docker.ts)
interface DockerService {
  // ... existing fields
  runtime?: string; // 'runsc', 'kata-runtime', etc.
}

// In compose generation, conditionally set:
if (config.containerRuntime === 'gvisor') {
  agentService.runtime = 'runsc';
  // Remove iptables-init service entirely
  // Remove network_mode: service:agent
  // Adjust security_opt (no seccomp — gVisor handles it)
}

2. Network Security Refactoring

Move iptables rules to host level (src/host-iptables-rules.ts):

  • Current: Host rules in DOCKER-USER chain + container rules in iptables-init
  • Proposed: Host rules handle ALL filtering for gVisor/Kata; container iptables-init only for runc

The host-level DOCKER-USER chain rules already exist and work regardless of container runtime. They need to be expanded to cover the DNAT-to-Squid functionality currently handled by the init container.

3. Volume Security

No changes needed — all runtimes support OCI bind mounts. For Kata:

  • Bind mounts use virtio-fs (transparent to AWF)
  • tmpfs overlays work inside the VM
  • Performance may be lower for heavy I/O workloads

4. Image Compatibility

No image changes needed. All AWF images are standard OCI images:

  • ubuntu:22.04 (agent)
  • ubuntu/squid:latest (Squid)
  • node:22-alpine (API proxy, CLI proxy)

All runtimes (runc, runsc, kata) consume OCI images identically.

5. CLI and Configuration

// src/cli-options.ts - new option
.option('--container-runtime <runtime>',
  'Container runtime to use (runc, gvisor, kata, auto)',
  'runc')

// src/types/runtime-options.ts
containerRuntime?: 'runc' | 'gvisor' | 'kata' | 'auto';

auto mode would detect available runtimes and select the most secure option.


Security Analysis

Security Properties by Runtime

Property runc + AWF hardening gVisor Kata
Kernel exploit protection ❌ Shares host kernel ✅ User-space kernel (Sentry) ✅ Separate guest kernel
Syscall filtering Seccomp (350 allowed) Sentry intercepts all (~277 implemented) Seccomp + VM boundary
Network isolation iptables + Squid L7 proxy Netstack + Squid L7 proxy VM network + iptables + Squid L7 proxy
Filesystem isolation chroot + bind mounts gVisor overlay FS + bind mounts virtio-fs + bind mounts
Container escape risk Medium (kernel shared) Low (user-space kernel) Very low (VM boundary)
Performance overhead Baseline ~10-30% syscall overhead ~20-50% startup, I/O overhead

Non-Negotiable Security Requirements

Regardless of runtime, these must be maintained:

  1. All HTTP/HTTPS traffic MUST route through Squid — Domain ACL enforcement is the core security guarantee
  2. Proxy env vars (HTTP_PROXY, HTTPS_PROXY) MUST be set — For proxy-aware tools
  3. Dangerous ports MUST be blocked — SSH, SMTP, databases, Redis, etc.
  4. DNS MUST be restricted — Only whitelisted DNS servers
  5. Sensitive paths MUST NOT be mounted — No /etc/shadow, no unwhitelisted home dirs
  6. Capabilities MUST be dropped before user code — No NET_ADMIN, SYS_ADMIN at user code time
  7. OCI image format — All images must work across all runtimes without modification

Implementation Phases

Phase 1: Refactor iptables to support host-only mode

  • Move DNAT rules from iptables-init container to host DOCKER-USER chain
  • Keep iptables-init as optional (for backward compatibility with runc)
  • This unblocks gVisor support without requiring gVisor-specific iptables

Phase 2: Add --container-runtime flag

  • Add CLI option and config file support
  • Add runtime: field to Docker Compose generation
  • Conditionally skip iptables-init for non-runc runtimes

Phase 3: Runtime-specific security profiles

  • gVisor: Simplified seccomp (or none), rely on Sentry
  • Kata: Full seccomp inside VM, adjust resource limits
  • Validation: Ensure all security tests pass with each runtime

Phase 4: CI/CD integration

  • Add smoke tests for each supported runtime
  • GitHub Actions runners with gVisor/Kata pre-installed
  • Performance benchmarking across runtimes

Open Questions

  1. gVisor iptables DNAT support: Need to empirically test whether iptables -t nat -A OUTPUT -p tcp --dport 443 -j DNAT --to-destination 172.30.0.10:3128 works inside a gVisor sandbox. The docs say "partial support" but don't enumerate supported features.

  2. gVisor + Docker Compose runtime: field: Docker Compose v3 doesn't have a runtime: field. It was added in Compose v2 format. Need to verify compatibility or use docker run --runtime=runsc directly instead of Compose.

  3. Kata + network namespace sharing: Can Kata's "sandbox" concept (multiple containers in one VM) replace network_mode: service:agent? Need to test.

  4. Performance impact: What's the real-world performance overhead for typical AWF workloads (npm install, git clone, curl) under each runtime?

  5. Docker SBX: "Docker SBX" appears to refer to Docker's sandbox mode using gVisor internally. Need to clarify whether this is a distinct product or just Docker + gVisor.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions