Skip to content

fix(dflash): daemon-mode OOM resilience (refs #114)#166

Merged
davide221 merged 1 commit into
mainfrom
fix/issue-114-daemon-oom-resilience
May 13, 2026
Merged

fix(dflash): daemon-mode OOM resilience (refs #114)#166
davide221 merged 1 commit into
mainfrom
fix/issue-114-daemon-oom-resilience

Conversation

@davide221

@davide221 davide221 commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Refs #114. Reduces the daemon-death symptom: daemon now survives transient cudaMalloc OOM during prefill instead of dying and taking every subsequent request with it. Does not auto-close #114 — the underlying VRAM pressure at max_ctx=65536 on 24 GB is still real; this PR is defense in depth + a documented workaround config that fits hermes-agent. Maintainers can close the issue when they consider it resolved.

Failure mode

At max_ctx=65536 on a 24 GB card, target weights (~15 GB) + KV reservations + a single prefix-cache snapshot leave ~1-1.5 GB headroom. The prefill gallocr pre-reserve needs ~1.2 GB of scratch; once a snap is committed it pushes us into a regime where one transient cudaMalloc(1216 MiB) fails. The daemon's three OOM-fatal sites in token-segmented prefill (prefill gallocr pre-reserve failed, prefill build @N, prefill compute @N) all return 1; from main() — the process dies. Next chat completion hits BrokenPipe writing to daemon stdin.

What this PR changes (65 lines)

File Change
dflash/test/test_dflash.cpp Three OOM-fatal sites: free every prefix snapshot to recover VRAM, emit [snap] all-cleared on stdout, stream_emit(-1) + abort the current request via continue (gallocr site) or goto _req_aborted_oom (in-loop sites that need to escape the prefill for-loop). One-shot CLI mode (daemon_mode=false) keeps fail-fast return 1 semantics unchanged.
dflash/scripts/prefix_cache.py DaemonStdoutBus.register_line_callback(prefix, cb) invokes cb() whenever the daemon emits a stdout line starting with prefix. PrefixCache.mark_all_cleared() drops every LRU entry + resets next_slot/_pending_evict_key.
dflash/scripts/server.py Wire bus.register_line_callback(\"[snap] all-cleared\", prefix_cache.mark_all_cleared) after the cache is built.

Validation

NousResearch/hermes-agent (31 tools, ~17K-token prompts) on RTX 3090 + Qwen3.6-27B Q4_K_M, --max-ctx 65536:

Stack Before After
Tasks completed 0-1 then daemon dies 10/10
OOM events 1 → fatal recovered, daemon alive
Per-task api_calls 3 retries then fallback 1 clean call
VRAM end-state 0 (process dead) stable

Recommended low-VRAM serving config

For users hitting #114 on a single 3090 with hermes-agent's mandatory 64K context floor:

DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=128 \\
  python3 scripts/server.py \\
    --tokenizer Qwen/Qwen3.6-27B --port 8000 \\
    --max-ctx 65536 --fa-window 1024 \\
    --prefix-cache-slots 1 --budget 8 --daemon

The critical lever is --budget 8 (vs default 22): a smaller DDTree spec-verify tree halves the prefill gallocr scratch and avoids the OOM in the first place. The patch is defense-in-depth for users who push past their VRAM budget.

Test plan

  • hermes-agent 10-task multi-turn loop on RTX 3090, max_ctx=65536 — all green
  • CUDA build clean, cmake --build build --target test_dflash succeeds
  • Reviewer: validate non-daemon CLI mode (./test_dflash target.gguf draft.safetensors prompt.bin 256 out.bin) still fail-fasts on OOM
  • Reviewer: HIP/gfx1151 daemon path (patches are platform-agnostic but only CUDA was hands-on tested)

🧙 Built with WOZCODE

@davide221 davide221 changed the title fix(dflash): daemon-mode OOM resilience (closes #114) fix(dflash): daemon-mode OOM resilience (refs #114) May 13, 2026
When the daemon hits a cudaMalloc OOM during prefill — either in the
gallocr pre-reserve or inside the token-segmented prefill loop — the
process used to `return 1;` and die, taking every subsequent request
with it (BrokenPipe on the next stdin write from server.py).

This patch keeps the daemon alive:

* test_dflash.cpp — at the three fatal OOM sites in token-segmented
  prefill (gallocr pre-reserve, in-loop build, in-loop compute), free
  every prefix snapshot to recover VRAM, emit `[snap] all-cleared` so
  the Python side can invalidate its LRU, and `stream_emit(-1)` +
  abort the current request via `continue` (gallocr site) or `goto
  _req_aborted_oom` (in-loop sites that need to escape the prefill
  for-loop before reaching the daemon while-loop). One-shot CLI mode
  (daemon_mode=false) still `return 1`s as before.

* prefix_cache.py — `DaemonStdoutBus.register_line_callback(prefix, cb)`
  invokes `cb()` whenever the daemon emits a stdout line starting with
  `prefix`. `PrefixCache.mark_all_cleared()` drops every LRU entry and
  resets `next_slot`/`_pending_evict_key` so the next request does not
  RESTORE a freed slot.

* server.py — wire `bus.register_line_callback("[snap] all-cleared",
  prefix_cache.mark_all_cleared)` after the cache is built.

Validated with NousResearch/hermes-agent (31 tools, ~17K-token prompts)
on RTX 3090 / Qwen3.6-27B Q4_K_M, max_ctx=65536: 10/10 multi-turn tasks
complete cleanly, daemon stable. Recommended low-VRAM serving config
that fits hermes-agent on a single 3090:

  DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=128 \
    python3 scripts/server.py \
      --tokenizer Qwen/Qwen3.6-27B --port 8000 \
      --max-ctx 65536 --fa-window 1024 \
      --prefix-cache-slots 1 --budget 8 --daemon

Refs #114. Does not auto-close; maintainers can close when satisfied.
@davide221 davide221 force-pushed the fix/issue-114-daemon-oom-resilience branch from 5946c89 to 2781459 Compare May 13, 2026 10:17
@davide221 davide221 merged commit 843c62e into main May 13, 2026
1 check passed

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/prefix_cache.py">

<violation number="1" location="dflash/scripts/prefix_cache.py:554">
P1: `mark_all_cleared()` only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +554 to +558
n = len(self.entries)
self.entries.clear()
self.next_slot = 0
self._pending_evict_key = None
print(f"{self.log_prefix} all-cleared — dropped {n} LRU entries", flush=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: mark_all_cleared() only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/prefix_cache.py, line 554:

<comment>`mark_all_cleared()` only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.</comment>

<file context>
@@ -525,6 +541,22 @@ def lookup(self, prompt_ids: list[int]) -> tuple[int, int] | None:
+        """
+        if self.disabled:
+            return
+        n = len(self.entries)
+        self.entries.clear()
+        self.next_slot = 0
</file context>
Suggested change
n = len(self.entries)
self.entries.clear()
self.next_slot = 0
self._pending_evict_key = None
print(f"{self.log_prefix} all-cleared — dropped {n} LRU entries", flush=True)
n_prefix = len(self.entries)
self.entries.clear()
self.next_slot = 0
self._pending_evict_key = None
n_full = 0
if not getattr(self, "_full_disabled", True):
n_full = len(self.full_entries)
self.full_entries.clear()
self._full_next_slot = 0
self._full_pending_evict_key = None
self._full_pending_evict_path = None
print(f"{self.log_prefix} all-cleared — dropped {n_prefix} prefix + {n_full} full entries", flush=True)

@davide221 davide221 deleted the fix/issue-114-daemon-oom-resilience branch May 19, 2026 11:40
oliveagle pushed a commit to oliveagle/lucebox-hub that referenced this pull request May 22, 2026
…m-resilience

fix(dflash): daemon-mode OOM resilience (refs Luce-Org#114)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DFlash Server.py Out of memory on a single 3090

1 participant