Skip to content

fix(fs): cross-worker snapshot loading for session fork#172

Merged
r33drichards merged 3 commits into
mainfrom
fix/fork-cross-worker-snapshot-loading
Jun 16, 2026
Merged

fix(fs): cross-worker snapshot loading for session fork#172
r33drichards merged 3 commits into
mainfrom
fix/fork-cross-worker-snapshot-loading

Conversation

@r33drichards

Copy link
Copy Markdown
Owner

Loading a heap + filesystem snapshot created by a different worker — the session-fork case (read a source session's snapshots, seed a fresh worker from them) — failed three ways. This makes it work: a worker can restore another worker's heap and fs from the shared S3 blob store.

Fixes

  1. Isolate runtime. execute_module drove the V8 event loop on the ambient multi-thread runtime. deno_core's op driver schedules every pending async op via deno_unsync::spawn, which asserts a current-thread runtime; any op that stayed pending (e.g. an fs blob fetched from S3 on a cold cache) aborted the process. The isolate now always runs on a dedicated current-thread runtime.

  2. S3 client runtime affinity. The S3 client's connection pool / IO reactor lives on the runtime it was built on (the server's main runtime). Calls issued from the isolate's current-thread runtime never progressed. S3HeapStorage now captures that handle and dispatches every call onto it.

  3. Lazy in-op fs fetch. fs reads pull chunks on demand from inside the op, on the isolate runtime, which cannot await the blob backend's remote I/O. Added HeapStorage::warm/contains and FsStore::prefetch; build_fs_mount warms the mounted tree's blobs into the node-local cache on the main runtime before the isolate runs, so in-op reads are pure local-cache hits. warm() skips already-local blobs cheaply, so same-worker sessions are unaffected.

Verification

Coordinator + 2 learners + MinIO. A learner restores another learner's heap and fs from S3; a fork accumulates its own changes across runs (VAR=heap-7+mod FILE=fs-7+more on re-run) while the source is unchanged (VAR=heap-7 FILE=fs-7) — copy-on-write isolation. 99 lib tests pass.

Note on debug builds

A debug-only assertion in the deno_core fork's serialize_for_snapshotting (by_name.len() == handles.len()) fires when re-snapshotting a cross-loaded isolate. It is compiled out in release, and the resulting snapshot is valid (verified end-to-end on a release build), so production is unaffected. Run a release binary for local fork testing, or relax that assertion in the fork for debug builds.

🤖 Generated with Claude Code

r33drichards and others added 3 commits June 16, 2026 00:14
Loading a heap + filesystem snapshot created by a *different* worker (the
session-fork case: read source snapshots, seed a new worker) hit three issues.
Fixed so a worker can restore another worker's heap and fs from shared S3.

1. Isolate runtime. execute_module drove the V8 event loop on the ambient
   multi-thread runtime. deno_core's op driver schedules every pending async op
   via deno_unsync::spawn, which asserts a current-thread runtime; any op that
   stayed pending (e.g. an fs blob fetched from S3 on a cold cache) aborted the
   process. The isolate now always runs on a dedicated current-thread runtime.

2. S3 client affinity. The S3 client's connection pool / IO reactor lives on the
   runtime it was built on (the server's main runtime). Calls issued from the
   isolate's current-thread runtime never progressed. S3HeapStorage now captures
   that runtime handle and dispatches every call onto it.

3. Lazy in-op fs fetch. fs reads pull chunks on demand from inside the op, on
   the isolate runtime, which cannot await the blob backend's remote I/O. Added
   HeapStorage::warm/contains and FsStore::prefetch; build_fs_mount now warms the
   mounted tree's blobs into the node-local cache on the main runtime before the
   isolate runs, so in-op reads are pure local-cache hits. warm() skips
   already-local blobs cheaply, so same-worker sessions are unaffected.

Verified on a coordinator + 2 learners + MinIO: a learner restores another
learner's heap and fs from S3; a fork accumulates its own changes across runs
while the source is unchanged (copy-on-write isolation).

Note: a debug-only assertion in the deno_core fork's serialize_for_snapshotting
fires when re-snapshotting a cross-loaded isolate. It is compiled out in release
(the snapshot is valid — verified end-to-end on a release build), so production
is unaffected; run a release binary for local fork testing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…O) setup

PR #171 added a configurable S3 endpoint + path-style addressing so the S3
backend can target S3-compatible stores, but only AWS_ENDPOINT_URL was
documented. Document AWS_S3_FORCE_PATH_STYLE (required by MinIO et al.), add a
"Use an S3-compatible store" how-to, and correct the client-init description
(no longer plain load_from_env).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ss-worker fs

Reverts the per-execution current-thread runtime change. Running the isolate on
its own runtime broke async ops that await resources bound to the server's main
runtime — notably mcp.callTool (child MCP clients) deadlocked (MCP Tool Calling
E2E hung). The prefetch (build_fs_mount warms the mounted tree into the local
cache on the main runtime) already makes in-op fs reads local, so no pending
remote I/O happens inside the isolate and the original deno_unsync abort no
longer triggers — without changing the isolate's runtime.

Verified: mcp.callTool round-trips again, and cross-worker fork still restores
heap + fs (VAR=heap-7 FILE=fs-7) on a release build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@r33drichards r33drichards merged commit 34fa496 into main Jun 16, 2026
13 of 22 checks passed
r33drichards added a commit to r33drichards/open-agents that referenced this pull request Jun 16, 2026
Add a "Duplicate" action to mcp-js sessions that creates a new session seeded
from the source's V8 heap AND content-addressed filesystem, so the fork starts
with the source's accumulated state and then diverges copy-on-write.

- API: POST /api/sessions/[id]/fork reads the source's latest heap+fs snapshot
  ids from its running worker and creates a new session carrying them as a
  `forkSource` marker; optional `copyMessages` clones the source's latest chat
  history.
- Provisioning: buildMcpJsSandboxState seeds a forked worker from forkSource
  once (a no-op run mounting the source heap+fs), then clears the marker so
  later restores don't reset the fork. baseUrl is left empty initially so
  isSandboxActive doesn't skip provisioning.
- DB: sessions.parentSessionId (fork lineage) + forkSessionWithChat; the sidebar
  list now carries a lightweight sandboxType (from sandbox_state JSON).
- UI: a Duplicate dropdown (sandbox-only / with chat history), shown only for
  mcp-js sessions, wired through useSessions.duplicateSession.
- mcp-js client: run_js now sends/returns `fs`; McpJsState gains `forkSource`;
  McpJsSandbox.getState() now includes the `type` discriminant (it was dropping
  it, corrupting the persisted sandbox state).

Requires the mcp-js fork fixes (r33drichards/mcp-js#172) for cross-worker
snapshot loading; verified end-to-end against a local cluster + MinIO (release
binary): a fork inherits the source's heap+fs, accumulates its own changes
across runs, and the source is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant