Skip to content

Claude provider always returns empty transcript: guest container runs as root, claude CLI rejects --permission-mode bypassPermissions, error is silently swallowed #22

@helei-ai

Description

@helei-ai

Summary

When running an agent with provider claude in deploy mode, the run finishes with status succeeded but transcript, finalText, sessionId and stderr are all empty. duration_ms is around 600.

The real root cause is:

  • agent-compose starts the guest container as root by default.
  • ClaudeRunner asks the SDK to use permissionMode = bypassPermissions, which the SDK forwards as the CLI flag --permission-mode bypassPermissions (equivalent to --dangerously-skip-permissions).
  • The bundled claude CLI refuses to run with that flag under root, exits with code 1, and prints the error only to stderr.
  • The SDK spawns the child with stderr = ignore unless DEBUG_CLAUDE_AGENT_SDK is set, so that stderr message is dropped.
  • The SDK then surfaces "Claude Code process exited with code 1" to the runner, but ClaudeRunner swallows the error in finally and returns an empty AgentResult.

Net result: every claude run silently produces an empty transcript on a default agent-compose deploy, with no diagnostic output anywhere.

This issue replaces my earlier description that focused on pathToClaudeCodeExecutable. After deeper investigation, the path argument is not the trigger; running as root + bypassPermissions is.

Environment

  • agent-compose: latest main, version v2606.2.0
  • deploy mode (docker compose), Linux aarch64 guest container
  • guest image: agent-compose-guest:latest, default USER is root
  • LLM gateway: Anthropic compatible HTTP gateway (verified independently with curl /v1/messages returning a valid response)

Reproduction

agent-compose.yml content:

name: hello-llm
variables:
  LLM_API_KEY: { value: ${LLM_API_KEY}, secret: true }
  LLM_API_ENDPOINT: { value: ${LLM_API_ENDPOINT}, secret: true }
  LLM_MODEL: { value: ${LLM_MODEL} }
agents:
  reviewer:
    provider: claude
    image: agent-compose-guest:latest
    driver: { docker: {} }
    env:
      LLM_API_KEY: { value: ${LLM_API_KEY}, secret: true }
      LLM_API_ENDPOINT: { value: ${LLM_API_ENDPOINT}, secret: true }
      LLM_MODEL: { value: ${LLM_MODEL} }
network: { mode: default }

Commands:

agent-compose --file /data/agent-compose.yml up
agent-compose --file /data/agent-compose.yml run reviewer --prompt "Reply with single word: pong"

Result:

  • status = succeeded
  • duration_ms is around 600
  • finalText, transcript, sessionId and stderr are all empty strings

Evidence (proven by direct experiments)

  1. The gateway itself is fine. A direct curl to /v1/messages returns the expected pong.

  2. Inside the guest container running as root, manual CLI also fails the same way:

    /usr/lib/node_modules/agent-compose-runtime-js/node_modules/@anthropic-ai/claude-agent-sdk-linux-arm64/claude
    --output-format stream-json --verbose --input-format stream-json
    --permission-mode bypassPermissions --model glm-5.2

    stderr:

    --dangerously-skip-permissions cannot be used with root/sudo privileges for security reasons

    exit code: 1

  3. The same binary works as a non-root user:

    docker run --rm -u 1000:1000 ... agent-compose-guest:latest
    claude --print --permission-mode bypassPermissions --model glm-5.2 "Reply pong"

    prints: pong (exit 0)

  4. The same binary works under root if bypassPermissions is not used:

    claude --print --model glm-5.2 "Reply pong"

    prints: pong (exit 0)

  5. agent-compose docker driver does not set HostConfig.User. pkg/driver/docker_runtime.go does not contain any User assignment, so the guest container always runs as the image's default USER, which is root in the published agent-compose-guest image.

  6. ClaudeRunner.queryOptions in runtime/javascript/src/runners/claude.ts unconditionally requests:

    permissionMode: "bypassPermissions",
    allowDangerouslySkipPermissions: true,

    so the bypass flag is always sent.

  7. The SDK process transport spawns the child with stdio = ["pipe","pipe", DEBUG_CLAUDE_AGENT_SDK || options.stderr ? "pipe" : "ignore"]. ClaudeRunner does not pass options.stderr and does not set DEBUG_CLAUDE_AGENT_SDK, so the actual stderr line ("--dangerously-skip-permissions cannot be used with root...") is dropped.

  8. ClaudeRunner.runPrompt has a try / for-await / finally block. SDK throws "Claude Code process exited with code 1" on the iterator, but the catch is implicit, and finally just calls stream.close(). The error is not stored anywhere on AgentResult, and finalText / transcript / stderr remain empty. The run is reported as succeeded.

Root cause

Three independent layers combine to produce a silent failure:

  1. Default privilege mismatch: guest container runs as root.
  2. Forced unsafe flag: ClaudeRunner forces permissionMode = bypassPermissions, which the upstream claude CLI explicitly refuses to honor under root.
  3. Silent swallow: SDK ignores child stderr by default; ClaudeRunner swallows the resulting iterator error in finally.

Any single one of these would be tolerable. Together they produce a "succeeded with empty output" state that is extremely hard to diagnose.

Impact

  • Every provider: claude agent on a default deploy returns empty results.
  • status = succeeded, stderr = "", no log line, no exit reason exposed to the user.
  • All failed runs share the exact same fingerprint: duration_ms around 600, transcript empty.
  • This effectively blocks adopting agent-compose for claude-based agents on a stock install, while looking like a configuration issue to the user.

Suggested fixes

Pick at least one, ideally a combination:

  1. Stop running guest containers as root.

    • Add a non-root user (e.g. uid 1000) to the agent-compose-guest image and set USER to it, or
    • Make the docker driver set HostConfig.User to a non-root uid by default and expose it as a config knob.
  2. Stop forcing bypassPermissions in ClaudeRunner.

    • Make permissionMode configurable via agent spec, defaulting to a value that does not require non-root.
    • Or detect uid == 0 at runtime and fall back to a safer permission mode.
  3. Surface SDK errors.

    • In ClaudeRunner, catch errors from the for-await loop and write them to AgentResult.stderr (and mark stopReason accordingly).
    • Optionally pass a stderr callback to the SDK query so child stderr is captured into AgentResult.stderr instead of being dropped.

Doing only (3) already turns this from a silent failure into a diagnosable one. Doing (1) or (2) actually unblocks claude runs on a default deploy.

Notes

  • LLM_API_KEY -> ANTHROPIC_API_KEY and LLM_API_ENDPOINT -> ANTHROPIC_BASE_URL bridging in ClaudeRunner already works.
  • LLM_MODEL is not bridged to ANTHROPIC_MODEL, but model is also passed via SDK options when set, so this is not the trigger of the empty-transcript symptom.
  • The codex runner has a separate, related gap (no LLM_* to OPENAI_* bridging). That is worth tracking in a different issue.
  • Earlier in my investigation I attributed the failure to pathToClaudeCodeExecutable pointing at /usr/bin/claude. That observation was a side effect: under root, both the path provided by the runner and the SDK's bundled native binary fail the same way for the same reason (bypassPermissions under root). This issue supersedes that earlier description.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions