Skip to content

fix: error-output sanitization + process-lifecycle hardening (#111)#118

Merged
dtzp555-max merged 1 commit into
mainfrom
fix/111-error-output-lifecycle
May 31, 2026
Merged

fix: error-output sanitization + process-lifecycle hardening (#111)#118
dtzp555-max merged 1 commit into
mainfrom
fix/111-error-output-lifecycle

Conversation

@dtzp555-max

Copy link
Copy Markdown
Owner

Summary

Fixes the four error-output / process-lifecycle findings from the 2026-05-31 audit (#111, 3×P2 + 1×P3).

  1. sanitizeError() helper — streaming error paths sent raw claude error_message/stderr (home-dir/credential paths) to clients while the non-streaming path redacted them. One helper now applied at all 9 client-facing jsonResponse/sendSSE error emits; 3 pre-existing inline .replace() sites de-duped. Operator logs + admin-gated endpoints left raw by design.
  2. res.on("close") SIGKILL escalation — client disconnect sent only SIGTERM; a SIGTERM-resistant child held its slot until the request timeout (narrow ops: Mac OCP recurring daily hang — 500 concurrency (8/8) on v3.10.0 #37 on the hottest exit path). Now escalates to SIGKILL 5s after SIGTERM, cleared on exit. Per review, gated on the child still being alive (exitCode===null && signalCode===null) so the normal-success close no longer fires a spurious SIGTERM or leaks the timer; killTimer.unref().
  3. Quota TOCTOU — documented as best-effort (inline + README): overshoot bounded by MAX_CONCURRENT, cache hits uncounted. Chose documentation over an in-flight counter to avoid a decrement-on-all-paths liability (the ops: Mac OCP recurring daily hang — 500 concurrency (8/8) on v3.10.0 #37 class) on a low-blast-radius internal family limiter.
  4. [P3] overallTimer cleared on semantic completion — new clearOverallTimer() (clears the timer ONLY — never touches the cleaned slot flag, so no slot leak) called in the streaming stop-success path; cleanup() on exit still clears + decrements.

ALIGNMENT.md (server.mjs hard requirements)

  1. cli.js citation: N/A — error-shaping/process-lifecycle/rate-limit-doc forward no Anthropic operation. Rule 2.
  2. CI blacklist: no blacklisted tokens / port literals introduced; alignment.yml passes.
  3. Independent reviewer (Iron Rule 10): fresh-context opus reviewer verified the two CRITICAL concerns clean — clearOverallTimer does not reintroduce the ops: Mac OCP recurring daily hang — 500 concurrency (8/8) on v3.10.0 #37 slot-leak class (only clears timer, never sets cleaned; slot still decremented via the untouched proc.once("exit", cleanup)), and no SIGKILL double-kill hazard — plus full sanitizeError coverage, quota-doc honesty, ALIGNMENT N/A. Verdict APPROVE WITH MINOR; MINOR #1 (success-path kill-timer leak) folded in; MINOR #2 (pre-existing path-regex over-redaction of ratios/URLs) left out-of-scope. npm test152 passed, 0 failed.

Closes #111.

🤖 Generated with Claude Code

Four findings from the 2026-05-31 audit (3×P2 + 1×P3), all in the error/process
lifecycle layer:

1. sanitizeError() helper — streaming error paths sent raw claude error_message /
   stderr to the client, leaking home-dir / credential-file paths that the
   non-streaming path already redacted. Factored the path-strip regex into one
   helper and applied it at all 9 client-facing jsonResponse/sendSSE error emits;
   de-duped the 3 pre-existing inline .replace() sites. Operator-log calls
   (logEvent/trackError) and admin-gated endpoints left raw by design.

2. res.on("close") SIGKILL escalation — a client disconnect sent only SIGTERM; a
   SIGTERM-resistant child held its concurrency slot until the request timeout
   (narrow #37 on the hottest exit path). Now escalates to SIGKILL 5s after SIGTERM,
   cleared on proc exit. Per review: gated on the child still being alive
   (exitCode===null && signalCode===null) so the normal-success close no longer
   fires a spurious SIGTERM or leaks the 5s timer; killTimer.unref() so a genuine
   disconnect timer never delays graceful shutdown.

3. Per-key quota TOCTOU — documented as best-effort/eventually-consistent (inline
   comment + README note): concurrent requests at the boundary can overshoot by up
   to MAX_CONCURRENT and cache hits are uncounted. Chose documentation over an
   in-flight counter to avoid a decrement-on-all-paths liability (the #37 class) on
   a low-blast-radius internal family rate-limiter — not a payment boundary.

4. [P3] overallTimer cleared on semantic completion — the request timer was cleared
   only on proc exit, so a streamed response that res.end()'d before the child
   exited could record a spurious post-success timeout. New clearOverallTimer()
   (clears the timer ONLY, never touches the `cleaned` slot-accounting flag — no
   slot leak) is called in the streaming stop-success path; cleanup() on exit still
   clears it idempotently and decrements the slot.

ALIGNMENT.md: error-shaping / process-lifecycle / rate-limit documentation forward
no Anthropic operation, so a cli.js citation is N/A under Rule 2. No blacklisted
tokens or port literals introduced; alignment.yml passes.

Independent fresh-context reviewer (opus): APPROVE WITH MINOR (Iron Rule 10) — the
two critical concerns (clearOverallTimer slot-leak class, SIGKILL double-kill) were
verified clean; MINOR #1 (kill-timer leak on success path) folded in; MINOR #2
(pre-existing regex over-redaction of ratios/URLs) left as out-of-scope.

Closes #111.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dtzp555-max dtzp555-max merged commit c3b1f32 into main May 31, 2026
5 checks passed
@dtzp555-max dtzp555-max deleted the fix/111-error-output-lifecycle branch May 31, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P2] Server error-output sanitization + process-lifecycle reliability (sanitizeError, res-close SIGKILL, quota TOCTOU, overallTimer)

2 participants