fix(dflash): daemon-mode OOM resilience (refs #114) by davide221 · Pull Request #166 · Luce-Org/lucebox-hub

davide221 · 2026-05-13T10:13:42Z

Summary

Refs #114. Reduces the daemon-death symptom: daemon now survives transient cudaMalloc OOM during prefill instead of dying and taking every subsequent request with it. Does not auto-close #114 — the underlying VRAM pressure at max_ctx=65536 on 24 GB is still real; this PR is defense in depth + a documented workaround config that fits hermes-agent. Maintainers can close the issue when they consider it resolved.

Failure mode

At max_ctx=65536 on a 24 GB card, target weights (~15 GB) + KV reservations + a single prefix-cache snapshot leave ~1-1.5 GB headroom. The prefill gallocr pre-reserve needs ~1.2 GB of scratch; once a snap is committed it pushes us into a regime where one transient cudaMalloc(1216 MiB) fails. The daemon's three OOM-fatal sites in token-segmented prefill (prefill gallocr pre-reserve failed, prefill build @N, prefill compute @N) all return 1; from main() — the process dies. Next chat completion hits BrokenPipe writing to daemon stdin.

What this PR changes (65 lines)

File	Change
`dflash/test/test_dflash.cpp`	Three OOM-fatal sites: free every prefix snapshot to recover VRAM, emit `[snap] all-cleared` on stdout, `stream_emit(-1)` + abort the current request via `continue` (gallocr site) or `goto _req_aborted_oom` (in-loop sites that need to escape the prefill for-loop). One-shot CLI mode (`daemon_mode=false`) keeps fail-fast `return 1` semantics unchanged.
`dflash/scripts/prefix_cache.py`	`DaemonStdoutBus.register_line_callback(prefix, cb)` invokes `cb()` whenever the daemon emits a stdout line starting with `prefix`. `PrefixCache.mark_all_cleared()` drops every LRU entry + resets `next_slot`/`_pending_evict_key`.
`dflash/scripts/server.py`	Wire `bus.register_line_callback(\"[snap] all-cleared\", prefix_cache.mark_all_cleared)` after the cache is built.

Validation

NousResearch/hermes-agent (31 tools, ~17K-token prompts) on RTX 3090 + Qwen3.6-27B Q4_K_M, --max-ctx 65536:

Stack	Before	After
Tasks completed	0-1 then daemon dies	10/10
OOM events	1 → fatal	recovered, daemon alive
Per-task `api_calls`	3 retries then fallback	1 clean call
VRAM end-state	0 (process dead)	stable

Recommended low-VRAM serving config

For users hitting #114 on a single 3090 with hermes-agent's mandatory 64K context floor:

DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=128 \\
  python3 scripts/server.py \\
    --tokenizer Qwen/Qwen3.6-27B --port 8000 \\
    --max-ctx 65536 --fa-window 1024 \\
    --prefix-cache-slots 1 --budget 8 --daemon

The critical lever is --budget 8 (vs default 22): a smaller DDTree spec-verify tree halves the prefill gallocr scratch and avoids the OOM in the first place. The patch is defense-in-depth for users who push past their VRAM budget.

Test plan

hermes-agent 10-task multi-turn loop on RTX 3090, max_ctx=65536 — all green
CUDA build clean, cmake --build build --target test_dflash succeeds
Reviewer: validate non-daemon CLI mode (./test_dflash target.gguf draft.safetensors prompt.bin 256 out.bin) still fail-fasts on OOM
Reviewer: HIP/gfx1151 daemon path (patches are platform-agnostic but only CUDA was hands-on tested)

🧙 Built with WOZCODE

When the daemon hits a cudaMalloc OOM during prefill — either in the gallocr pre-reserve or inside the token-segmented prefill loop — the process used to `return 1;` and die, taking every subsequent request with it (BrokenPipe on the next stdin write from server.py). This patch keeps the daemon alive: * test_dflash.cpp — at the three fatal OOM sites in token-segmented prefill (gallocr pre-reserve, in-loop build, in-loop compute), free every prefix snapshot to recover VRAM, emit `[snap] all-cleared` so the Python side can invalidate its LRU, and `stream_emit(-1)` + abort the current request via `continue` (gallocr site) or `goto _req_aborted_oom` (in-loop sites that need to escape the prefill for-loop before reaching the daemon while-loop). One-shot CLI mode (daemon_mode=false) still `return 1`s as before. * prefix_cache.py — `DaemonStdoutBus.register_line_callback(prefix, cb)` invokes `cb()` whenever the daemon emits a stdout line starting with `prefix`. `PrefixCache.mark_all_cleared()` drops every LRU entry and resets `next_slot`/`_pending_evict_key` so the next request does not RESTORE a freed slot. * server.py — wire `bus.register_line_callback("[snap] all-cleared", prefix_cache.mark_all_cleared)` after the cache is built. Validated with NousResearch/hermes-agent (31 tools, ~17K-token prompts) on RTX 3090 / Qwen3.6-27B Q4_K_M, max_ctx=65536: 10/10 multi-turn tasks complete cleanly, daemon stable. Recommended low-VRAM serving config that fits hermes-agent on a single 3090: DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=128 \ python3 scripts/server.py \ --tokenizer Qwen/Qwen3.6-27B --port 8000 \ --max-ctx 65536 --fa-window 1024 \ --prefix-cache-slots 1 --budget 8 --daemon Refs #114. Does not auto-close; maintainers can close when satisfied.

cubic-dev-ai

1 issue found across 3 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/prefix_cache.py">

<violation number="1" location="dflash/scripts/prefix_cache.py:554">
P1: `mark_all_cleared()` only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-05-13T10:18:45Z

+        n = len(self.entries)
+        self.entries.clear()
+        self.next_slot = 0
+        self._pending_evict_key = None
+        print(f"{self.log_prefix} all-cleared — dropped {n} LRU entries", flush=True)


P1: mark_all_cleared() only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/prefix_cache.py, line 554: <comment>`mark_all_cleared()` only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.</comment> <file context> @@ -525,6 +541,22 @@ def lookup(self, prompt_ids: list[int]) -> tuple[int, int] | None: + """ + if self.disabled: + return + n = len(self.entries) + self.entries.clear() + self.next_slot = 0 </file context>

Suggested change

n = len(self.entries)

self.entries.clear()

self.next_slot = 0

self._pending_evict_key = None

print(f"{self.log_prefix} all-cleared — dropped {n} LRU entries", flush=True)

n_prefix = len(self.entries)

self.entries.clear()

self.next_slot = 0

self._pending_evict_key = None

n_full = 0

if not getattr(self, "_full_disabled", True):

n_full = len(self.full_entries)

self.full_entries.clear()

self._full_next_slot = 0

self._full_pending_evict_key = None

self._full_pending_evict_path = None

print(f"{self.log_prefix} all-cleared — dropped {n_prefix} prefix + {n_full} full entries", flush=True)

…m-resilience fix(dflash): daemon-mode OOM resilience (refs Luce-Org#114)

davide221 changed the title ~~fix(dflash): daemon-mode OOM resilience (closes #114)~~ fix(dflash): daemon-mode OOM resilience (refs #114) May 13, 2026

davide221 force-pushed the fix/issue-114-daemon-oom-resilience branch from 5946c89 to 2781459 Compare May 13, 2026 10:17

davide221 merged commit 843c62e into main May 13, 2026
1 check passed

cubic-dev-ai Bot reviewed May 13, 2026

View reviewed changes

davide221 deleted the fix/issue-114-daemon-oom-resilience branch May 19, 2026 11:40

oliveagle pushed a commit to oliveagle/lucebox-hub that referenced this pull request May 22, 2026

Merge pull request Luce-Org#166 from Luce-Org/fix/issue-114-daemon-oo…

bc6a5c2

…m-resilience fix(dflash): daemon-mode OOM resilience (refs Luce-Org#114)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dflash): daemon-mode OOM resilience (refs #114)#166

fix(dflash): daemon-mode OOM resilience (refs #114)#166
davide221 merged 1 commit into
mainfrom
fix/issue-114-daemon-oom-resilience

davide221 commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        n = len(self.entries)
-        self.entries.clear()
-        self.next_slot = 0
-        self._pending_evict_key = None
-        print(f"{self.log_prefix} all-cleared — dropped {n} LRU entries", flush=True)
+        n_prefix = len(self.entries)
+        self.entries.clear()
+        self.next_slot = 0
+        self._pending_evict_key = None
+        n_full = 0
+        if not getattr(self, "_full_disabled", True):
+            n_full = len(self.full_entries)
+            self.full_entries.clear()
+            self._full_next_slot = 0
+            self._full_pending_evict_key = None
+            self._full_pending_evict_path = None
+        print(f"{self.log_prefix} all-cleared — dropped {n_prefix} prefix + {n_full} full entries", flush=True)

Conversation

davide221 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Failure mode

What this PR changes (65 lines)

Validation

Recommended low-VRAM serving config

Test plan

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davide221 commented May 13, 2026 •

edited

Loading