Skip to content

docs(reborn): subagent thread harness — foundational design + PR0/PR1 plan#5176

Open
henrypark133 wants to merge 1 commit into
mainfrom
docs/subagent-thread-harness-design
Open

docs(reborn): subagent thread harness — foundational design + PR0/PR1 plan#5176
henrypark133 wants to merge 1 commit into
mainfrom
docs/subagent-thread-harness-design

Conversation

@henrypark133

Copy link
Copy Markdown
Collaborator

Summary

Docs-only. Foundational design + implementation plan for the subagent thread harness (Reborn). Reframes subagents from one-shot computations into first-class, addressable, resumable threads that both the parent agent and a human can attach to — driven by four product requirements:

  1. Parent inspects a child's work anytime.
  2. Human opens a child's thread in the console and continues conversing.
  3. Parent re-activates a finished child with an extension task (no cold start).
  4. Clean-slate, best-in-class harness (informed by a survey of Claude Code/Codex, LangGraph, Temporal, Devin, OpenAI Assistants).

Supersedes the 2026-06-08 subagent-durability sub-spec (removed here) and closes #4656: the SQL gate-resolution stack (3 tables + settlement log + 2-phase ledger + bucketed counter + replay reconciler, ~2,700 LOC) is replaced by a single ScopedFilesystem await-edge file per parent-waits-on-child, with the child's own run record as the source of truth and CAS-based idempotent delivery (no ledger). terminal becomes per-run; threads park and re-activate.

What's here

  • docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md — the single source of truth for the durable delivery/thread design (hardened over 3 adversarial review rounds).
  • docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md — bite-sized TDD implementation plan for PR 0 (prereqs) + PR 1 (durable await-edge walking skeleton, blocking-only), converged over 7 thermo-nuclear review rounds (clean: no BLOCKER/MAJOR).
  • Removes the superseded durability sub-spec. Live-code docs (subagent-spawn, compaction, planner) intentionally kept.

Key design decisions

  • Child = first-class thread + lineage; terminal is per-run, thread parks and re-activates via one activate(thread, input, provenance) primitive.
  • Delivery = one filesystem await-edge file; lifecycle open → settled → drained (+ abandoned for parent-dies-first), CAS transitions = idempotency (no ledger).
  • Blocking reuses TurnStatus::BlockedDependentRun with a real await-edge GateRef (no new status variant) — a subagent wait reads as an ordinary dependent-run wait.
  • Safety: child→parent content is framed-as-untrusted + capped (no semantic scan); approvals always bubble to a human (no LLM auto-approves) — the catastrophe backstop.
  • Reborn-only; everything gated behind subagent.v2_enabled (default off).

Rollout

Vertical walking skeleton: PR 0 (flag, helper visibility, config) → PR 1 (await-edge store + resolver + crash-recovery proof + flag-gated blocking delivery) → later PRs add inspect / extend / human-console / cancel.

No code changes in this PR.

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5176 June 24, 2026 05:10 Destroyed
@github-actions github-actions Bot added scope: docs Documentation size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules labels Jun 24, 2026
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Added a new design spec for durable subagent threads, including the updated lifecycle, single-active-run behavior, and restart recovery flow.
    • Added an implementation plan for a flag-gated rollout of durable blocking support, with clear phases for enabling persistence, recovery, and end-to-end validation.
    • Clarified the expected behavior for edge settlement, resume handling, and boot-time recovery so the new subagent experience is defined end to end.

Walkthrough

Adds two new documentation files: a foundational design spec (2026-06-23-subagent-thread-harness-design.md) superseding prior subagent durability specs and PR #4656, and an implementation plan (2026-06-23-subagent-thread-harness-pr0-pr1.md) decomposing the work into PR 0 (flag plumbing) and PR 1 (FilesystemAwaitEdgeStore, resolver, and runtime wiring). No source code is changed.

Changes

Durable subagent thread harness

Layer / File(s) Summary
Thread-first model, await-edge semantics, and blocking wake
docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md
Establishes scope and superseded artifacts; ratifies core semantic decisions (terminal-per-run, activate reactivation, one-run-per-thread, CAS idempotency, depth floor, subagent.v2_enabled); introduces two-layer architecture with blocking/background delivery semantics; specifies GateRef-based TurnStatus::BlockedDependentRun reuse and boot-recovery for blocking parents.
Thread lifecycle, capability surfaces, and durability storage detail
docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md
Defines Active/Parked/Archived lifecycle and reactivation invariants; specifies inspect/extend/spawn/cancel tool surfaces; details ScopedFilesystem layout, CAS state machine (open→settled→drained/abandoned), resolver settlement/recovery algorithm, capacity enforcement, and explicit delta from PR #4656.
Safety/trust model, rollout plan, test plan, and references
docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md
Defines gate-by-gate approval-bubbling to a human surface; lists open verification items; provides sequential vertical-slice rollout steps behind subagent.v2_enabled; enumerates PR-by-PR acceptance criteria; collects references to superseded specs and relevant crates.
PR 0 — flag plumbing and wrap_untrusted_subagent_text visibility
docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md
Defines global constraints (durable CAS await-edge, no production unwrap/expect, strict flag-gating); enumerates PR 0 file changes: pub(crate) on wrap_untrusted_subagent_text, subagent.v2_enabled TOML/CLI/config/builder wiring, and builder/default tests.
PR 1 tasks 1.1–1.3 — await-edge types, FilesystemAwaitEdgeStore, and durability tests
docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md
Defines await-edge/gate-ref types; FilesystemAwaitEdgeStore interface/semantics (open, get, CAS transitions, unclosed-edge listing, boot-recovery enumeration); durability/idempotency and scope-isolation integration tests; spawn depth floor regression test.
PR 1 tasks 1.4–1.6 — resolver, settler, spawn hook, observer, and runtime wiring
docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md
Specifies AwaitEdgeResolver and object-safe AwaitEdgeSettler; flag-gated await-edge write at spawn with parent blocking via GateRef; completion observer settle→resume→drain sequence with crash-recovery ordering constraints; composition/runtime wiring; boot-time recovery routine; end-to-end integration test expectations (flag-on delivery, restart delivery, flag-off no-write).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related issues

  • #4656 — This spec explicitly supersedes the durable gate resolution store from PR #4656, replacing its SQL-backed DurableSubagentGateResolutionStore with a ScopedFilesystem CAS await-edge and eliminating the settlement log, capacity-counter buckets, and gate/result indirections defined there.
  • #4474 — Both cover the same durable subagent completion-delivery machinery (await-edge/state persistence, terminal-event settlement, parent resume/recovery), making this a direct design successor.

Poem

An edge is opened, a parent waits in place,
CAS transitions guard each file with grace.
Settled, drained — the thread wakes again,
Boot recovery rescues what the crash couldn't chain.
v2_enabled: false keeps the old path clean. 🦀

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning This PR only adds docs and plans; it does not deliver the durable store, migration, or reconciler code required by #4656. Implement the gate-resolution store, capacity counter, migration, and security tests specified in #4656, or split the docs into a separate non-closing PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is conventional, specific, and accurately summarizes the docs-only subagent thread harness design and PR0/PR1 plan.
Description check ✅ Passed The description is detailed and covers the design, rollout, and linked issue, though several template sections are left unfilled.
Out of Scope Changes check ✅ Passed No code or unrelated functional changes were introduced; the diff stays within documentation for the subagent thread harness.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added the contributor: core 20+ merged PRs label Jun 24, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/plans/2026-06-10-triggered-run-delivery-owner-scope.md`:
- Around line 398-404: The validation steps in this section don’t match the
repo’s required pre-review gate. Update the commands in this plan to mirror the
contribution checklist by using the strict fmt/clippy variants, adding the
required build step, and keeping the full test run; reference the existing “Step
2: Full test suite” block and the surrounding quality-gate commands so they
align with the mandated pre-PR checks.
- Around line 313-323: Remove the contradictory fallback claim from the commit
message text in the triggered-run delivery owner scope plan and keep the
description aligned with the stated scope. Update the wording around the
trigger/thread ownership change so it only says new first-fire bindings persist
the creator as owner and existing Direct-route bindings are not backfilled,
without asserting that any delivery-side fallback covers them. Use the commit
message block in the plan document as the target for the edit.

In `@docs/plans/2026-06-18-triggers-on-rootfilesystem.md`:
- Around line 188-194: The thread identity derivation is inconsistent: one path
includes creator_user_id while other paths already rely on
fire.identity.route_thread_id() using tenant_id, trigger_id, and fire_slot only.
Update the canonical thread_id hashing rules so every use of ensure_thread and
TriggerFireIdentity.route_thread_id() shares the same input set, and remove any
alternate derivation in the trusted-trigger fire path. Make sure all
recovery/idempotency-related lookups use that single canonical hash definition.

In `@docs/plans/2026-06-22-reborn-runtime-hang-investigation.md`:
- Around line 5-7: The incident note in the reborn-runtime-hang-investigation
document exposes a full user UUID, so redact or pseudonymize that identifier
while preserving the troubleshooting context. Update the text in the affected
incident summary to use a masked user reference instead of the raw UUID, keeping
the same details about the tenant and outage but removing the direct user
identifier.

In `@docs/reborn/2026-06-16-memory-how-it-works.md`:
- Around line 13-14: Update the memory search docs so they match the runtime
behavior in dispatch_search: the default Reborn build disables vector search
explicitly via with_vector(false), not because embeddings are missing, and the
defaults are limit 5 with pre_fusion_limit set to max(limit, 20). Adjust the
affected description in memory-how-it-works.md to reflect these actual defaults
and the fact that only FTS runs by default.

In `@docs/superpowers/plans/2026-06-16-persistent-user-context-profile.md`:
- Around line 924-933: The integration test for profile rendering is asserting
the old compact location format, but the runtime now uses sanitized
untrusted-user-data framing for location. Update the assertions in
profile_set_then_runtime_context_renders_local_time to match the current
rendered contract by checking the explicit untrusted-user-data output around
location while still verifying the Asia/Tokyo local time and locale fields,
using the same capability-dispatch and loop-driver host flow to locate the
behavior.
- Around line 758-764: The existing profile merge logic in the retry loop is
swallowing corrupt or non-JSON stored documents by using
serde_json::from_slice(...).unwrap_or_default(), which can overwrite bad data
with an empty object. Update the merge path in this block to fail closed: when
deserialization of current bytes fails, return operation_error() instead of
defaulting, and keep the normal flow only for valid JSON or missing documents.
Use the current/read_document handling and the existing content_bytes_sha256
helper as the anchor points for locating the fix.
- Around line 261-263: The plan currently mixes two profile contracts: one part
says to use only UserProfileContext.timezone with no ResolvedUserProfile, while
later steps still reference ResolvedUserProfile and user_timezone. Normalize the
entire document to a single model by updating all affected steps/snippets to use
UserProfileContext directly, removing any remaining ResolvedUserProfile
mentions, and keeping the LoopRunContext-based trait shape consistent with the
intended load_identity_candidates pattern.

In `@docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md`:
- Around line 278-279: The boot-recovery design is inconsistent between the
recovery driver and the spec, so reconcile it to a single canonical flow. Update
the recovery plan around list_parents_with_unclosed_edges and the boot-recovery
section to match the spec’s active-parent-run enumeration approach, or
explicitly revise the spec if the edge-store scan is intended to be the source
of truth. Make sure the description of TurnStateStore and any recovery
orchestration steps use one consistent driver so there is no split
implementation path.

In `@docs/superpowers/specs/2026-06-16-persistent-user-context-profile-design.md`:
- Around line 82-83: The profile-rendering spec is treating free-text location
as safe inline context, which weakens the prompt-injection boundary. Update the
user profile examples and guidance to frame location as untrusted user-provided
data, with explicit “not instructions” language and quoted rendering instead of
`location=...` style inline context. Keep the timezone behavior unchanged in the
existing time-rendering branch, and make sure the spec wording aligns with the
tested contract for the profile rendering logic.

In `@docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md`:
- Around line 113-114: Update the Subagent Thread Harness design spec so the
awaited edge uses the gate-ref-compatible hyphenated format, not the
colon-separated placeholder. In the section describing the real GateRef, replace
the `gate:subagent-await:<child_run_id>` example with the same naming convention
used by the implementation plan, and keep the wording aligned with the resume
precondition behavior for blocked parents.

In `@scripts/loadtest/parse_concurrency.py`:
- Around line 174-177: The throughput calculation in the concurrency parsing
logic can divide by a zero duration when `max(finished_times) -
min(started_times)` is 0, so update the branch in `parse_concurrency.py` to
mirror the existing duration guard used later in the file. In the throughput
section that uses `n_finished`, `n_started`, `finished_times`, and
`started_times`, compute throughput only when the duration is greater than 0;
otherwise fall back to 0.0 to avoid `ZeroDivisionError`.

In `@scripts/loadtest/stub_llm_server.py`:
- Line 93: Bind the stub LLM server to loopback instead of all interfaces by
updating the ThreadingHTTPServer setup in stub_llm_server.py so the server
listens on 127.0.0.1 for args.port; keep the change localized to the server
initialization path and ensure any related startup or logging messages in
stub_llm_server reflect the loopback-only binding.

In `@scripts/loadtest/sweep.sh`:
- Around line 15-16: The prerequisites comment in sweep.sh is stale because the
load test Python tools are stdlib-only, so remove the mention of httpx and pip
install -r requirements.txt from the header. Update the prerequisite text near
the sweep.sh comment to list only the actual required tools (python3, cargo,
curl) and keep it consistent with the headers in loadgen.py and
stub_llm_server.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: e6626008-9618-4ed8-b4a3-6e32956864fc

📥 Commits

Reviewing files that changed from the base of the PR and between e754655 and f0780ac.

📒 Files selected for processing (18)
  • docs/plans/2026-06-10-triggered-run-delivery-owner-scope.md
  • docs/plans/2026-06-14-trusted-trigger-classification-as-type-4851.md
  • docs/plans/2026-06-15-google-oauth-account-fork-fix.md
  • docs/plans/2026-06-18-triggers-on-rootfilesystem.md
  • docs/plans/2026-06-22-reborn-runtime-hang-investigation.md
  • docs/plans/2026-06-22-turn-state-lock-wedge-fix.md
  • docs/plans/2026-06-23-github-pat-auth-bugs-1-2.md
  • docs/reborn/2026-06-08-subagent-durability-spec.md
  • docs/reborn/2026-06-16-memory-how-it-works.md
  • docs/superpowers/plans/2026-06-16-persistent-user-context-profile.md
  • docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md
  • docs/superpowers/specs/2026-06-16-persistent-user-context-profile-design.md
  • docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md
  • scripts/loadtest/loadgen.py
  • scripts/loadtest/parse_concurrency.py
  • scripts/loadtest/run_serve.sh
  • scripts/loadtest/stub_llm_server.py
  • scripts/loadtest/sweep.sh

Comment on lines +313 to +323
git commit -m "fix(conversations): thread trigger creator into trusted binding resolution

Trigger fires previously resolved conversation bindings with
trusted_owner_user_id=None, so triggered-run TurnScopes carried
ActorFallback ownership. Slack delivery then reconstructed a thread
scope with owner None while the thread was stored with Some(actor),
silently skipping the outbound push. New first-fire bindings now
persist the creator as owner. Existing Direct-route bindings are
deliberately not backfilled (scope-equality compat for in-flight
runs); the delivery-side fallback covers them."
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Remove the fallback claim from the commit message.

Line 322 says “the delivery-side fallback covers them,” but Line 17 explicitly declares that fallback out of scope. This contradiction will mislead implementation/review.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/2026-06-10-triggered-run-delivery-owner-scope.md` around lines 313
- 323, Remove the contradictory fallback claim from the commit message text in
the triggered-run delivery owner scope plan and keep the description aligned
with the stated scope. Update the wording around the trigger/thread ownership
change so it only says new first-fire bindings persist the creator as owner and
existing Direct-route bindings are not backfilled, without asserting that any
delivery-side fallback covers them. Use the commit message block in the plan
document as the target for the edit.

Comment on lines +398 to +404
Run: `cargo fmt && cargo clippy --all --benches --tests --examples --all-features 2>&1 | tail -5 && cargo test -p ironclaw_conversations -p ironclaw_reborn_composition -p ironclaw_turns -p ironclaw_triggers --features slack-v2-host-beta`
Expected: fmt clean, clippy zero warnings, all tests pass (`ironclaw_turns` included because the fix consumes `TurnScope::new_with_owner`)

- [ ] **Step 2: Full test suite**

Run: `cargo test`
Expected: PASS (catches any workspace-wide fallout from the `TrustedInboundTurnRequest` signature change — the type is `pub(crate)`, so fallout should be zero outside `ironclaw_conversations`)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Align the quality gate commands with repo-required pre-PR checks.

This block skips the required strict flags and cargo build listed in repo contribution rules; it should mirror the mandated pre-review gate command set.

As per coding guidelines: “Run the local validation checks required before requesting a review… cargo fmt --all -- --check, cargo clippy ... -- -D warnings, cargo build, cargo test.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/2026-06-10-triggered-run-delivery-owner-scope.md` around lines 398
- 404, The validation steps in this section don’t match the repo’s required
pre-review gate. Update the commands in this plan to mirror the contribution
checklist by using the strict fmt/clippy variants, adding the required build
step, and keeping the full test run; reference the existing “Step 2: Full test
suite” block and the surrounding quality-gate commands so they align with the
mandated pre-PR checks.

Source: Coding guidelines

Comment on lines +188 to +194
The trusted-trigger fire path supplies a deterministic `thread_id` derived from
`TriggerFireIdentity` to `ensure_thread`
(`crates/ironclaw_threads/src/contract.rs:300`,
`filesystem_service.rs:697-724`, `CasExpectation::Absent`), instead of the binding
layer minting a random UUID. The canonical `thread_id` for a fire is
`H(tenant_id, creator_user_id, trigger_id, fire_slot)`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Use one canonical hash input set for thread identity.

Line 193 adds creator_user_id to thread_id derivation, but Lines 84-89 and 385-387 rely on fire.identity.route_thread_id() (tenant/trigger/fire_slot). Mixed derivation rules will break idempotent lookup/recovery.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/2026-06-18-triggers-on-rootfilesystem.md` around lines 188 - 194,
The thread identity derivation is inconsistent: one path includes
creator_user_id while other paths already rely on
fire.identity.route_thread_id() using tenant_id, trigger_id, and fire_slot only.
Update the canonical thread_id hashing rules so every use of ensure_thread and
TriggerFireIdentity.route_thread_id() shares the same input set, and remove any
alternate derivation in the trusted-trigger fire path. Make sure all
recovery/idempotency-related lookups use that single canonical hash definition.

Comment on lines +5 to +7
A local-dev `ironclaw-reborn serve` instance (tenant `reborn-cli`, single user
`e3aa23b0-e23e-4e1f-9b14-b8954e543200`) became **completely unresponsive to all
users for ~30 minutes**. The user has no triggers of their own, but other users'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Redact direct user identifiers in incident docs.

Line 6 publishes a full user UUID. Keep incident utility, but mask or pseudonymize user identifiers to avoid persisting sensitive identifiers in repo history.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/2026-06-22-reborn-runtime-hang-investigation.md` around lines 5 -
7, The incident note in the reborn-runtime-hang-investigation document exposes a
full user UUID, so redact or pseudonymize that identifier while preserving the
troubleshooting context. Update the text in the affected incident summary to use
a masked user reference instead of the raw UUID, keeping the same details about
the tenant and outage but removing the direct user identifier.

Comment on lines +13 to +14
- Search is **hybrid FTS + vector with RRF fusion**, but in the default Reborn build **no embedding provider is wired**, so vector search is inactive and only FTS runs.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Search behavior docs are out of sync with runtime implementation.

Line 13 and Lines 179-182 currently describe vector as disabled due to missing embeddings and list defaults (limit 20, pre_fusion_limit 50, both branches on). In crates/ironclaw_host_runtime/src/first_party_tools/memory.rs (dispatch_search), vector is explicitly disabled via .with_vector(false), default limit is 5, and pre_fusion_limit is max(limit, 20).

Suggested doc correction
-- Search is **hybrid FTS + vector with RRF fusion**, but in the default Reborn build **no embedding provider is wired**, so vector search is inactive and only FTS runs.
+- Search infrastructure supports **hybrid FTS + vector with RRF fusion**, but the current runtime path (`dispatch_search`) explicitly sets `.with_vector(false)`, so only FTS executes today.
...
-`MemorySearchResult` = `{ path, score, snippet, full_text_rank, vector_rank }`. Search defaults: `limit` 20 (capped per capability schema to ≤20), `pre_fusion_limit` 50, both branches on, RRF k=60.
+`MemorySearchResult` = `{ path, score, snippet, full_text_rank, vector_rank }`. In the current runtime capability path, default `limit` is 5 (clamped to 1..20), `pre_fusion_limit` is `max(limit, 20)`, and vector is disabled for this call path.
 
-**Caveat:** the default Reborn `build_backend` wiring passes **no embedding provider**, so the vector branch is inert; only FTS contributes. Hybrid search is built but runs single-modality until an embedding provider is wired at the host boundary.
+**Caveat:** hybrid search exists in the backend, but this capability path currently forces FTS-only by request construction (`with_vector(false)`).

Also applies to: 179-182

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reborn/2026-06-16-memory-how-it-works.md` around lines 13 - 14, Update
the memory search docs so they match the runtime behavior in dispatch_search:
the default Reborn build disables vector search explicitly via
with_vector(false), not because embeddings are missing, and the defaults are
limit 5 with pre_fusion_limit set to max(limit, 20). Adjust the affected
description in memory-how-it-works.md to reflect these actual defaults and the
fact that only FTS runs by default.

Comment on lines +113 to +114
**real** `GateRef` that names the edge (e.g. `gate:subagent-await:<child_run_id>`)
— not a synthetic placeholder, but a genuine reference to the awaited edge — so

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

GateRef format conflicts with the PR plan and can break resume preconditions.

Line 113 uses gate:subagent-await:<child_run_id>, but the implementation plan requires gate:subagent-await-<child_run_id> (hyphen suffix) for gate-ref compatibility. This mismatch can cause blocked-parent resume to miss its expected gate key.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md` around
lines 113 - 114, Update the Subagent Thread Harness design spec so the awaited
edge uses the gate-ref-compatible hyphenated format, not the colon-separated
placeholder. In the section describing the real GateRef, replace the
`gate:subagent-await:<child_run_id>` example with the same naming convention
used by the implementation plan, and keep the wording aligned with the resume
precondition behavior for blocked parents.

Comment on lines +244 to +250
**Path identity & access patterns.** The edge is keyed by
`(parent_run_id, child_run_id)`. Two access patterns, both cheap:
- **Live settle:** a child knows its own `parent_run_id` (lineage), so the edge
is a direct path lookup — no by-child index.
- **Parent drain:** the parent lists its own directory
`/await-edges/<parent_run_id>/` (single-level `list_dir`).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Await-edge directory path is inconsistent within this spec.

Line 219 defines /turns/subagent-await-edges/..., but Line 249 says parent drain lists /await-edges/<parent_run_id>/. Please normalize to one canonical path; otherwise store/list implementations will diverge.

Comment thread scripts/loadtest/parse_concurrency.py Outdated
Comment on lines +174 to +177
if n_finished > 1 and n_started > 0:
throughput = n_finished / (max(finished_times) - min(started_times))
else:
throughput = 0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Guard against zero-duration division.

max(finished_times) - min(started_times) can be 0 (single instantaneous run, or all timestamps colliding at parser resolution), raising ZeroDivisionError. The bucket path at Line 185 already guards duration > 0; mirror it here.

Proposed fix
-    if n_finished > 1 and n_started > 0:
-        throughput = n_finished / (max(finished_times) - min(started_times))
-    else:
-        throughput = 0.0
+    span = max(finished_times) - min(started_times) if (n_finished > 1 and n_started > 0) else 0.0
+    throughput = n_finished / span if span > 0 else 0.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if n_finished > 1 and n_started > 0:
throughput = n_finished / (max(finished_times) - min(started_times))
else:
throughput = 0.0
span = max(finished_times) - min(started_times) if (n_finished > 1 and n_started > 0) else 0.0
throughput = n_finished / span if span > 0 else 0.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/loadtest/parse_concurrency.py` around lines 174 - 177, The throughput
calculation in the concurrency parsing logic can divide by a zero duration when
`max(finished_times) - min(started_times)` is 0, so update the branch in
`parse_concurrency.py` to mirror the existing duration guard used later in the
file. In the throughput section that uses `n_finished`, `n_started`,
`finished_times`, and `started_times`, compute throughput only when the duration
is greater than 0; otherwise fall back to 0.0 to avoid `ZeroDivisionError`.

Comment thread scripts/loadtest/stub_llm_server.py Outdated
latency_s = args.latency_ms / 1000.0
handler = make_handler(latency_s)

server = ThreadingHTTPServer(("0.0.0.0", args.port), handler)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

Bind the stub to loopback, not 0.0.0.0.

The harness already pins the real server to 127.0.0.1 (run_serve.sh Line 120) and the stub URL is http://127.0.0.1:9999/v1. Binding to all interfaces needlessly exposes the canned-response endpoint on the LAN with no auth.

Proposed fix
-    server = ThreadingHTTPServer(("0.0.0.0", args.port), handler)
+    server = ThreadingHTTPServer(("127.0.0.1", args.port), handler)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
server = ThreadingHTTPServer(("0.0.0.0", args.port), handler)
server = ThreadingHTTPServer(("127.0.0.1", args.port), handler)
🧰 Tools
🪛 Ruff (0.15.18)

[error] 93-93: Possible binding to all interfaces

(S104)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/loadtest/stub_llm_server.py` at line 93, Bind the stub LLM server to
loopback instead of all interfaces by updating the ThreadingHTTPServer setup in
stub_llm_server.py so the server listens on 127.0.0.1 for args.port; keep the
change localized to the server initialization path and ensure any related
startup or logging messages in stub_llm_server reflect the loopback-only
binding.

Comment thread scripts/loadtest/sweep.sh Outdated
Comment on lines +15 to +16
# Prerequisites: python3, httpx installed (pip install -r requirements.txt),
# cargo in PATH, curl in PATH.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Stale prerequisite: no httpx/requirements.txt needed.

Both Python tools state stdlib-only (loadgen.py Line 14, stub_llm_server.py header), so httpx installed (pip install -r requirements.txt) is wrong and sends users on a dead-end setup. Drop it.

Proposed fix
-# Prerequisites: python3, httpx installed (pip install -r requirements.txt),
-#                cargo in PATH, curl in PATH.
+# Prerequisites: python3 (stdlib only), cargo in PATH, curl in PATH.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Prerequisites: python3, httpx installed (pip install -r requirements.txt),
# cargo in PATH, curl in PATH.
# Prerequisites: python3 (stdlib only), cargo in PATH, curl in PATH.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/loadtest/sweep.sh` around lines 15 - 16, The prerequisites comment in
sweep.sh is stale because the load test Python tools are stdlib-only, so remove
the mention of httpx and pip install -r requirements.txt from the header. Update
the prerequisite text near the sweep.sh comment to list only the actual required
tools (python3, cargo, curl) and keep it consistent with the headers in
loadgen.py and stub_llm_server.py.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0780accb7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/loadtest/sweep.sh Outdated
Comment on lines +71 to +73
if ! curl -sf "http://127.0.0.1:9999/v1/models" >/dev/null 2>&1; then
echo "ERROR: stub LLM server did not start" >&2
exit 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail when the stub port is already occupied

When port 9999 is already held by a previous stub or another OpenAI-compatible service that answers /v1/models, the newly spawned stub_llm_server.py exits with EADDRINUSE but this readiness curl succeeds against the old process. The sweep then records a dead STUB_PID and runs against whatever latency/model behavior the stale server has, corrupting the throughput results; check that the spawned PID is still alive and/or fail on an occupied port before accepting readiness.

Useful? React with 👍 / 👎.

@railway-app

railway-app Bot commented Jun 24, 2026

Copy link
Copy Markdown

🚅 Deployed to the ironclaw-pr-5176 environment in ironclaw-ci-preview

Service Status Web Updated (UTC)
ironclaw ✅ Success (View Logs) Web Jun 24, 2026 at 6:13 am

Foundational design + bite-sized TDD plan for the Reborn subagent thread
harness: subagents become first-class, addressable, resumable threads (parent
and human can attach), with durable delivery via a single ScopedFilesystem
await-edge per parent-waits-on-child (no SQL gate stack). Design hardened over
3 adversarial review rounds; PR0+PR1 plan converged over 7.

Removes the superseded 2026-06-08 subagent-durability sub-spec (replaced by this
design). Other subagent docs (spawn mechanics, compaction, planner) describe
live shipped code and are intentionally kept.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@henrypark133 henrypark133 force-pushed the docs/subagent-thread-harness-design branch from f0780ac to 613a307 Compare June 24, 2026 06:01
@github-actions github-actions Bot added size: XS < 10 changed lines (excluding docs) and removed size: XL 500+ changed lines labels Jun 24, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md`:
- Around line 684-689: The await-edge abstraction is being defined in the
higher-level reborn crate while SubagentSpawnDeps lives in loop_support, which
creates a reverse dependency risk or cycle. Move AwaitEdgeWriter into a
lower/shared crate alongside SubagentSpawnDeps, then have
FilesystemAwaitEdgeStore in await_edge.rs implement that trait and keep the
optional await_edge_store field typed against the shared trait. Update the
SubagentSpawnDeps wiring and any finish_spawn call sites to use the relocated
trait without introducing a reborn dependency back into loop_support.
- Around line 690-691: The runtime path reference in the Task 1.6c note is
inconsistent with the rest of the plan and points to the wrong crate. Update the
production-site reference from ironclaw_reborn/src/runtime.rs to
ironclaw_reborn_composition/src/runtime.rs, and keep the surrounding
SubagentSpawnDeps guidance unchanged so contributors implement the wiring in the
correct runtime module.

In `@docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md`:
- Around line 257-260: The boot recovery description conflicts with the PR1
await-edge store contract by relying on active parent runs instead of the
store-driven recovery entrypoint. Update the recovery spec around the boot
algorithm to use list_parents_with_unclosed_edges() as the source of truth, then
for each returned parent inspect its edge records and reconcile terminal or
abandoned children via get_run_state(child) and edge closure logic. Keep the
wording aligned with the existing await-edge store symbols so the boot flow is
clearly derived from the store contract rather than a separate active-run query.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: f075e771-bba5-4c30-8d34-24b0c63d4cc2

📥 Commits

Reviewing files that changed from the base of the PR and between f0780ac and 613a307.

📒 Files selected for processing (3)
  • docs/reborn/2026-06-08-subagent-durability-spec.md
  • docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md
  • docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md

Comment on lines +684 to +689
- Consumes: `FilesystemAwaitEdgeStore::write_open` + `AwaitEdge` (Tasks 1.1/1.2), `await_edge_gate_ref` (Task 1.1), `RebornRuntimeInput.subagent_v2_enabled` threaded into `SubagentSpawnDeps` (add a `subagent_v2_enabled: bool` field + optional `await_edge_store: Option<Arc<dyn AwaitEdgeWriter>>` where `AwaitEdgeWriter` is a tiny object-safe trait wrapping `write_open` — keeps `SubagentSpawnDeps` non-generic).
- Produces: when `subagent_v2_enabled`, `finish_spawn` writes an `open` await-edge (in addition to the existing `record_awaited_child`) and parks the parent in `BlockedDependentRun` with `BlockedReason::AwaitDependentRun { gate_ref: await_edge_gate_ref(child_run_id) }`.

- [ ] **Step 1: Read** `finish_spawn` (~847) and how the parent is currently blocked (`completion_observer.rs:3050` sets `BlockedDependentRun`). Note the exact block-construction call.
- [ ] **Step 2: Define `AwaitEdgeWriter`** (object-safe) in `await_edge.rs`: `#[async_trait] pub trait AwaitEdgeWriter: Send + Sync { async fn write_open(&self, scope: &TurnScope, parent: &TurnRunId, child: &TurnRunId, edge: &AwaitEdge) -> Result<(), AwaitEdgeError>; }` and impl it for `FilesystemAwaitEdgeStore<F>`. (This thin erasure trait exists solely to keep `SubagentSpawnDeps` non-generic — add a one-line comment saying so.) Add `subagent_v2_enabled: bool` and `await_edge_store: Option<Arc<dyn AwaitEdgeWriter>>` to `SubagentSpawnDeps`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | 🏗️ Heavy lift

Avoid a reverse crate dependency for AwaitEdgeWriter.

SubagentSpawnDeps is in crates/ironclaw_loop_support, but Line 688 defines AwaitEdgeWriter in crates/ironclaw_reborn/src/subagent/await_edge.rs and then threads it into SubagentSpawnDeps. That implies loop_support depends on reborn (or creates a cycle). Put the trait in loop_support (or another lower/shared crate) and implement it in reborn.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md` around
lines 684 - 689, The await-edge abstraction is being defined in the higher-level
reborn crate while SubagentSpawnDeps lives in loop_support, which creates a
reverse dependency risk or cycle. Move AwaitEdgeWriter into a lower/shared crate
alongside SubagentSpawnDeps, then have FilesystemAwaitEdgeStore in await_edge.rs
implement that trait and keep the optional await_edge_store field typed against
the shared trait. Update the SubagentSpawnDeps wiring and any finish_spawn call
sites to use the relocated trait without introducing a reborn dependency back
into loop_support.

Comment on lines +690 to +691
**Architecture-rule compliance (B2):** `Option<Arc<dyn AwaitEdgeWriter>>` is a flag-gated optional dependency, which `.claude/rules/architecture.md` rule 2 flags. Annotate the field: `// arch-exempt: optional_arc — flag-gated v2 path, set iff subagent.v2_enabled; removed when v2 becomes default (plan 2026-06-23 cleanup PR)`. **Ripple:** `SubagentSpawnDeps` is a struct-literal with ~15 test construction sites + 1 production site (`ironclaw_reborn/src/runtime.rs:556`). Every site must add `subagent_v2_enabled: false, await_edge_store: None,`. Enumerate them first: `rg -ln "SubagentSpawnDeps" crates/` then grep each file for the struct-literal construction (rustfmt may break `SubagentSpawnDeps {` across lines, so don't rely on a brace-adjacent pattern); update each in this task (mechanical two-field additions: `subagent_v2_enabled: false, await_edge_store: None,`). The production site (`ironclaw_reborn/src/runtime.rs:556`) is wired for real in Task 1.6c.
- [ ] **Step 3: Write the failing test** — spawn with `subagent_v2_enabled=true` + a captured in-memory `AwaitEdgeWriter`; assert an `open` edge is written for the child. With `false`, assert none.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fix the runtime path inconsistency in the task instructions.

Line 690 points to crates/ironclaw_reborn/src/runtime.rs:556, but this plan’s file map and Task 1.6c both target crates/ironclaw_reborn_composition/src/runtime.rs. This mismatch can send implementation to the wrong crate.

Suggested doc fix
-- The production site (`ironclaw_reborn/src/runtime.rs:556`) is wired for real in Task 1.6c.
+- The production site (`ironclaw_reborn_composition/src/runtime.rs`) is wired for real in Task 1.6c.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Architecture-rule compliance (B2):** `Option<Arc<dyn AwaitEdgeWriter>>` is a flag-gated optional dependency, which `.claude/rules/architecture.md` rule 2 flags. Annotate the field: `// arch-exempt: optional_arc — flag-gated v2 path, set iff subagent.v2_enabled; removed when v2 becomes default (plan 2026-06-23 cleanup PR)`. **Ripple:** `SubagentSpawnDeps` is a struct-literal with ~15 test construction sites + 1 production site (`ironclaw_reborn/src/runtime.rs:556`). Every site must add `subagent_v2_enabled: false, await_edge_store: None,`. Enumerate them first: `rg -ln "SubagentSpawnDeps" crates/` then grep each file for the struct-literal construction (rustfmt may break `SubagentSpawnDeps {` across lines, so don't rely on a brace-adjacent pattern); update each in this task (mechanical two-field additions: `subagent_v2_enabled: false, await_edge_store: None,`). The production site (`ironclaw_reborn/src/runtime.rs:556`) is wired for real in Task 1.6c.
- [ ] **Step 3: Write the failing test** — spawn with `subagent_v2_enabled=true` + a captured in-memory `AwaitEdgeWriter`; assert an `open` edge is written for the child. With `false`, assert none.
**Architecture-rule compliance (B2):** `Option<Arc<dyn AwaitEdgeWriter>>` is a flag-gated optional dependency, which `.claude/rules/architecture.md` rule 2 flags. Annotate the field: `// arch-exempt: optional_arc — flag-gated v2 path, set iff subagent.v2_enabled; removed when v2 becomes default (plan 2026-06-23 cleanup PR)`. **Ripple:** `SubagentSpawnDeps` is a struct-literal with ~15 test construction sites + 1 production site (`ironclaw_reborn/src/runtime.rs:556`). Every site must add `subagent_v2_enabled: false, await_edge_store: None,`. Enumerate them first: `rg -ln "SubagentSpawnDeps" crates/` then grep each file for the struct-literal construction (rustfmt may break `SubagentSpawnDeps {` across lines, so don't rely on a brace-adjacent pattern); update each in this task (mechanical two-field additions: `subagent_v2_enabled: false, await_edge_store: None,`). The production site (`ironclaw_reborn_composition/src/runtime.rs`) is wired for real in Task 1.6c.
- [ ] **Step 3: Write the failing test** — spawn with `subagent_v2_enabled=true` + a captured in-memory `AwaitEdgeWriter`; assert an `open` edge is written for the child. With `false`, assert none.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-06-23-subagent-thread-harness-pr0-pr1.md` around
lines 690 - 691, The runtime path reference in the Task 1.6c note is
inconsistent with the rest of the plan and points to the wrong crate. Update the
production-site reference from ironclaw_reborn/src/runtime.rs to
ironclaw_reborn_composition/src/runtime.rs, and keep the surrounding
SubagentSpawnDeps guidance unchanged so contributors implement the wiring in the
correct runtime module.

Comment on lines +257 to +260
- *Boot:* enumerate active (non-terminal) parent runs; for each, `list_dir` its
edge directory; for each open edge, `get_run_state(child)`; settle the terminal
ones, close `abandoned` ones whose parent run is terminal. This is driven by
active runs — **no global recursive scan**.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Boot recovery algorithm conflicts with the PR1 store contract.

Line 257 says boot recovery enumerates active parent runs, but the PR1 contract defines recovery from list_parents_with_unclosed_edges() in the await-edge store (no active-run query dependency). This mismatch can cause implementation drift and missed recovery cases.

Suggested doc fix
- - *Boot:* enumerate active (non-terminal) parent runs; for each, `list_dir` its
-   edge directory; for each open edge, `get_run_state(child)`; settle the terminal
-   ones, close `abandoned` ones whose parent run is terminal. This is driven by
-   active runs — **no global recursive scan**.
+ - *Boot:* enumerate parents with unclosed await-edges from the await-edge store
+   (e.g., `list_parents_with_unclosed_edges`), then `list_dir` each parent edge
+   directory; for each open edge, `get_run_state(child)`; settle terminal ones,
+   and close `abandoned` when parent is terminal. This is driven by the edge
+   store — **no global recursive scan**.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- *Boot:* enumerate active (non-terminal) parent runs; for each, `list_dir` its
edge directory; for each open edge, `get_run_state(child)`; settle the terminal
ones, close `abandoned` ones whose parent run is terminal. This is driven by
active runs — **no global recursive scan**.
- *Boot:* enumerate parents with unclosed await-edges from the await-edge store
(e.g., `list_parents_with_unclosed_edges`), then `list_dir` each parent edge
directory; for each open edge, `get_run_state(child)`; settle terminal ones,
and close `abandoned` when parent is terminal. This is driven by the edge
store — **no global recursive scan**.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-06-23-subagent-thread-harness-design.md` around
lines 257 - 260, The boot recovery description conflicts with the PR1 await-edge
store contract by relying on active parent runs instead of the store-driven
recovery entrypoint. Update the recovery spec around the boot algorithm to use
list_parents_with_unclosed_edges() as the source of truth, then for each
returned parent inspect its edge records and reconcile terminal or abandoned
children via get_run_state(child) and edge closure logic. Keep the wording
aligned with the existing await-edge store symbols so the boot flow is clearly
derived from the store contract rather than a separate active-run query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: docs Documentation size: XS < 10 changed lines (excluding docs)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant